Data Problems: Difference between revisions

no edit summary
No edit summary
No edit summary
 
(12 intermediate revisions by 4 users not shown)
Line 1: Line 1:
The following are anecdotes contributed by users of iDigBio's data. They aim to be helpful in several ways:
The following are anecdotes contributed by users of iDigBio's data and data portal. They aim to be helpful in several ways:
#Anyone submitting data should read them and make adjustments and improvements in their own to avoid the issues.
#Anyone submitting data should read them and make adjustments and improvements in their own data to avoid the issues found by others.
#They can be a springboard for interested parties to address overall data quality issues.
#They can be a springboard for interested parties to address overall data quality issues.
#This is also the place to document portal inadequacies as you see them.
#This is also the place to document portal interaction difficulties.


2 iDigBio interest groups are aware of this documentation and feed it back to their respective groups. Progress and feedback from the developers is also noted here.
iDigBio interest groups are aware of this documentation and feed it back to their respective groups. Progress and feedback from the developers is also noted here.


Another useful feature to take note of are iDigBio's data flags - common data quality issues and data corrections that may be performed on recordsets to improve the capabilities of iDigBio Search (see https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags). Data quality flags are identified for each of the ingested recordsets on their respective portal webpage.<br>


===Anecdotes===
===User Anecdotes===
{| class="wikitable sortable" border="1"
{| class="wikitable sortable" border="1"
|-
|-
! scope="col" width="80%"  class="sortable" | Anecdote  
! scope="col" width="80%"  class="unsortable" | Anecdote  
! scope="col" width="20%" class="unsortable" | Contact  
! scope="col" width="20%" class="sortable" | Contact  
|-
|-
|valign="top"|
|valign="top"|
Line 18: Line 19:
**Also, we would not be able to use label data in TraitBank if the occurrences are licensed.  While we recognize licenses at the data set level, we do not implement them at the level of individual records.  We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit.  But we decided to err on the side of openness.  Is there any chance this issue could be brought up for discussion at iDigBio?
**Also, we would not be able to use label data in TraitBank if the occurrences are licensed.  While we recognize licenses at the data set level, we do not implement them at the level of individual records.  We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit.  But we decided to err on the side of openness.  Is there any chance this issue could be brought up for discussion at iDigBio?
**We'll have a little more work to do before we're ready to import any of the iDigBio data.  I'll let you know if there is any progress on our end.
**We'll have a little more work to do before we're ready to import any of the iDigBio data.  I'll let you know if there is any progress on our end.
|valign="top"|K. Schultz, EOL
|valign="top"|K. Schultz, EOL (2014) https://www.idigbio.org/redmine/issues/1393
|-
|-
|valign="top"|
|valign="top"|
Line 24: Line 25:
*I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  
*I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  
*It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well.
*It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well.
|valign="top"| C. Johnson, AEC
|valign="top"| C. Johnson, AEC (2/2015) https://www.idigbio.org/redmine/issues/1394
|-
|-
|valign="top"|
|valign="top"|
Line 37: Line 38:
**Terms should be evaluated for continuity. The term “row number” contains a space.
**Terms should be evaluated for continuity. The term “row number” contains a space.
**Ideally would like a tsv as well as a csv download. (support for tsv export format is coming in next release of portal)
**Ideally would like a tsv as well as a csv download. (support for tsv export format is coming in next release of portal)
|valign="top"| K. Seltmann, R. Rabeler, TTD TCN
|valign="top"| K. Seltmann, R. Rabeler, TTD TCN (2/2015) https://www.idigbio.org/redmine/issues/1395
|-
|-
|valign="top"|
|valign="top"|
Line 51: Line 52:


-->  To me, one of the things that iDigBio should be concerned about is having the portal be easily usable.  If we want it to be the "one stop" for biodiversity data, we need to see what users can get from other portals and provide improvements to that level of info.  If it's easier to get the info by using a combination of other sources, folks still might do that.  In the examples I sent along on the screen shots, that's what I am trying to show -  what we present should be at least as good as what you can get elsewhere.  If you compare the results of the "label" view that you get in the iDigBio portal with that in the CPNWH portal, it's clear (at least to me....) that ours is inferior for the reasons that I pointed out.
-->  To me, one of the things that iDigBio should be concerned about is having the portal be easily usable.  If we want it to be the "one stop" for biodiversity data, we need to see what users can get from other portals and provide improvements to that level of info.  If it's easier to get the info by using a combination of other sources, folks still might do that.  In the examples I sent along on the screen shots, that's what I am trying to show -  what we present should be at least as good as what you can get elsewhere.  If you compare the results of the "label" view that you get in the iDigBio portal with that in the CPNWH portal, it's clear (at least to me....) that ours is inferior for the reasons that I pointed out.
|valign="top"| R. Rabeler, TTD TCN
[[Media:Portal_comments_020215.pdf|See example]].
 
|valign="top"| R. Rabeler, TTD TCN (2/2015) https://www.idigbio.org/redmine/issues/1396
|-
|valign="top"|
Database Search resulted in a rich specimen record dataset (in this case for lichens and bryophytes) for a participant at the IPT workshop. The researcher wants only the distinct taxon names (and count of speicmens per distinct taxon name). The researcher describes that downloading the dataset is then followed by "a lot of work" to get the distinct list of taxon names to share with a colleague.
* How do we link to instructions for how to do this sort of task either a) through our existing UI if possible, or b) to an API example. In this use case, Alex and Matt verified it is possible already, to do this (sort of) through our existing API.
|valign="top"|Mac Alford, (2015), entered here by Deb.
|-
|valign="top"|Collections - It would be great if collections could get a sense of their collections uniqueness – what do they have in their collections that no other collection in the portal has – either taxonomically or geographically.  It would be great if you could get a uniqueness factor and display a summary of the unique aspects of the collection.  Maybe even include preparation information – do I have tissues that nobody else has of a particular species or from a particular area?  This would be very handy in grant proposals and in justifying the existence of particularly small collections.
|valign="top"| Andy Bentley (3/2015)
|-
|-
|valign="top"|Researchers – It would be very handy if researchers could subscribe to a portal in order to get an alert when specimens of target species or geographic regions are added to the portal.  For research purposes, this would alert them to new material that may warrant their inspection and would facilitate loan traffic from collections that have newly catalog material.
|valign="top"| Andy Bentley (3/2015)
|}
|}
946

edits