Scrum:Planning 20111129
Jump to navigation
Jump to search
Planning Meeting 11/29/2011
Problems
Problems with Storage
- Full text search at scale - use Riak?
- File system - cost, scale, access via web -> use object store, swift
- Efficient text store - divide data into text and objects
- Too big/unresiliant for 1 location
- Federation & replication is hard - Swift sort of has repl, Riak does but $$
- "Backup" and "Archive"
- Mapping many:many between imgs + specimens
Problems /w Local Data Processing
- Iteration / MapReduce performance (Riak may support natively)
- API programming ease
- File system support
- Access control / Metering / Monitoring & Policy
- Appliance vs service vs vm
- Port existing tools to run on our system
- Download results vs update iDigBio
- Image processing
Problems with Data Exposure
- Large requests eg results for "US" - metering, rate limiting
- Formats - JSON, XML, CSV - and heiarchical data
- Programatic access efficiency / latency, for r in set do
- API bindings for used languages
- Usage tracking
Problems with Portal
- Visualization depends on - geolocation - base mapping layers
- Full text/faceted search performance
- Taxon matching needs high quality name resolution service
- Comparison to existing portals
- Web design quality
- Typical software feature requests and bugs from users - bugs -> internal redmine (poor auth integration)
- Feedback, usage tracking
Problems w/ Peers and Partners
- How much data do peers get -
- Sharding and reassembly between peers - force full copy of metadata, shard objects - best to have one place with all images but allow peers to mirror sets
- Replication protocols - OAI-PMI
- multimaster updates
- data provenance and residnancy tracking
- Peer training and technology skills to run our stack - or packaging for simplicity
- Usage tracking of remote data access
- Peer storage of object versions
Problems with Ingestion
(bad data)
- Field mapping - standardize on Darwin, Audobon, etc
- Taxa Name -> LSID?....
- Georeferencing, provided, data importation, quality check
- Outlier detection/correction
- Staging/preview area assist with above
- Whole data set verses updates -> frequency of updates, DWC archive, TAPIR
- Human-in-loop vs Bulk
(good data)
- Specimen / Occurrence / Image ID [Local] -> GUID/URI (assign LSID range to each provider) [Global]
- Provenance Tracking - Collection, TCN, Uploader, Residency
- Versioning - overwrite vs append
- Required field set - GBIF minimum, not images
- Accepted protocols - TAPIR, DWC Arc, OAI-PMH, native to app, CSV, XLS, SQL
Tasks
Dec 11
- M- Fix c11node22
- A- Swift -> use 5 (6) nodes on c11
- A- Riak -> install on 5 (6) nodes on c11
Dec 17
- M+K+J- Sample dataset -> Ask Kate
- A+M- Select iDigBioCore from DWC + extensions
- M- Push sample data in + convert to GeoJSON
- A- Experiment/design som Riak queries - check performance, pick indexes
Dec 24
- A+M- Pick facets for search
- A- Faceted web search, output list
- A- GeoJSON + Polymaps
Future Sprints
- Does it scale?
- Get more (good) data
- Get more (bad) data
- 3rd party API (low level)
- API scale & access
- Peering