Data Problems: Difference between revisions

No change in size ,  9 February 2015
no edit summary
No edit summary
No edit summary
Line 16: Line 16:
|valign="top"|K. Schultz, EOL
|valign="top"|K. Schultz, EOL
|-
|-
|valign="top"|*It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.
|valign="top"|
*It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.
*I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  
*I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  
*It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well.
*It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well.
|valign="top"| C. Johnson, AEC
|valign="top"| C. Johnson, AEC
|-
|-
|valign="top"|*Download format and term definitions
|valign="top"|
*Download format and term definitions
**The columns after download are not in logical order. All columns that are identifiers should be clustered together, locality information clustered together, collecting event clustered etc. Within the clusters the data elements can be in a loose order, but the elements should be together.
**The columns after download are not in logical order. All columns that are identifiers should be clustered together, locality information clustered together, collecting event clustered etc. Within the clusters the data elements can be in a loose order, but the elements should be together.
**Several terms are included in the download that represent the same information, but are named only slightly different (ex. VerbatimEventDate, verbatimEventDate). These should be merged in the download file or at least returned next to each other in the download file.
**Several terms are included in the download that represent the same information, but are named only slightly different (ex. VerbatimEventDate, verbatimEventDate). These should be merged in the download file or at least returned next to each other in the download file.
**There is no document that defines the terms. One should be provided. Further, those definitions should have URI identifiers so that individuals can reuse them with confidence (including them in a meta.xml).
**There is no document that defines the terms. One should be provided. Further, those definitions should have URI identifiers so that individuals can reuse them with confidence (including them in a meta.xml).
*Portal behavior
*Portal behavior
**When searching the portal, certain fields should not be an exact match. These include Collector and Locality fields. There are others, but these were the most limiting.
**When searching the portal, certain fields should not be an exact match. These include Collector and Locality fields. There are others, but these were the most limiting.
**Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for.
**Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for.
*Minor issues
*Minor issues
**Terms should be evaluated for continuity. The term “row number” contains a space.
**Terms should be evaluated for continuity. The term “row number” contains a space.
5,887

edits