Data Problems: Difference between revisions

Jump to navigation Jump to search
Line 6: Line 6:
*Your Darwin Core archives have all the information we need, but filtering out all the label images will be a challenge.  We may have to employ a content-based image retrieval algorithm for the collections that have both label only and organism images, and this may take a while to develop.
*Your Darwin Core archives have all the information we need, but filtering out all the label images will be a challenge.  We may have to employ a content-based image retrieval algorithm for the collections that have both label only and organism images, and this may take a while to develop.


I was surprised to find a creative commons license link in the dcterms:rights field of occurrence files.  In the files I looked at (e.g., Recordset 69037495-438d-4dba-bf0f-4878073766f1), there is no dwc:rightsHolder entry in the occurrence file, so it appears that there is a license, but the licensor is not named?  If these occurrences really have license restrictions, this complicates things for us.  Our data model treats the image + metadata as one media object, and we cannot accommodate different licenses.  If the media & occurrence licenses are always the same, it wouldn't be a problem, but in cases where they are different, we could not use the data from the occurrence file.  This means descriptions and locality information could not be displayed alongside the image on EOL, and they would not be available through the EOL API, which considerably decreases the value of these images to our users.
**I was surprised to find a creative commons license link in the dcterms:rights field of occurrence files.  In the files I looked at (e.g., Recordset 69037495-438d-4dba-bf0f-4878073766f1), there is no dwc:rightsHolder entry in the occurrence file, so it appears that there is a license, but the licensor is not named?  If these occurrences really have license restrictions, this complicates things for us.  Our data model treats the image + metadata as one media object, and we cannot accommodate different licenses.  If the media & occurrence licenses are always the same, it wouldn't be a problem, but in cases where they are different, we could not use the data from the occurrence file.  This means descriptions and locality information could not be displayed alongside the image on EOL, and they would not be available through the EOL API, which considerably decreases the value of these images to our users.


Also, we would not be able to use label data in TraitBank if the occurrences are licensed.  While we recognize licenses at the data set level, we do not implement them at the level of individual records.  We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit.  But we decided to err on the side of openness.  Is there any chance this issue could be brought up for discussion at iDigBio?
**Also, we would not be able to use label data in TraitBank if the occurrences are licensed.  While we recognize licenses at the data set level, we do not implement them at the level of individual records.  We have had discussions about this and came to the conclusion that like measurements and facts, occurrence records are unlikely to be protected by copyright, especially when they are presented in a commonly used standard like DwC. Of course, we won't know for sure until somebody files a lawsuit.  But we decided to err on the side of openness.  Is there any chance this issue could be brought up for discussion at iDigBio?


We'll have a little more work to do before we're ready to import any of the iDigBio data.  I'll let you know if there is any progress on our end. (K. Schultz, EOL)
**We'll have a little more work to do before we're ready to import any of the iDigBio data.  I'll let you know if there is any progress on our end. (K. Schultz, EOL)


*It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.
*It seems that institutions have their own "unique" fields that they haven't equated with DwC (or the existing iDigBio fields) and so there are fields that probably could fit an existing field but don't. It might be useful to have a description from you or the institution as to what the field is so the user can merge info from two fields to reduce the number of "variables" (which many of the fields are in an analysis). Also it might be useful to request basic formatting standard with a field that has multiple bits of information contained within it so it makes it easier to parse (like using "|" as a separator) or merge and standardize.


I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  
*I found the data very difficult to work with for the pilot study on treehoppers. It took me over a week to clean it up and put like information together and standardize information so it could be used in an analysis - this includes dates, common names, scientific names, higher taxonomy. And, as Katja mentioned, if you search the portal on family name but the record doesn't have a higher taxonomic designation, you miss all those records and no one wants to search by hundreds of genus or species names one by one to make sure they are all there. Records should absolutely contain Order, Suborder, Family, Subfamily, Tribe (if appropriate) and genus names.  


It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well. (C. Johnson, AEC)
*It seems that most people view these data as species page information. However, if you try to use it to do an analysis, the format doesn't work well. (C. Johnson, AEC)


*Download format and term definitions
*Download format and term definitions
Line 27: Line 27:
**Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for.
**Higher taxonomy should be included to improve the search. Family name being the most important. If it is not in the dataset from the provider, it should automatically be added upon ingestion to iDigBio. Without the higher taxonomy, a user will miss specimen records they are likely looking for.


**Minor issues
*Minor issues
**Terms should be evaluated for continuity. The term “row number” contains a space.
**Terms should be evaluated for continuity. The term “row number” contains a space.
**Ideally would like a tsv as well as a csv download. (K. Seltmann, R. Rabeler, TTD TCN)
**Ideally would like a tsv as well as a csv download. (K. Seltmann, R. Rabeler, TTD TCN)
5,887

edits

Navigation menu