Example of trivial transformations on INHS fish dataset

From iDigBio
Revision as of 15:21, 3 January 2014 by Joanna (talk | contribs)
Jump to navigation Jump to search

Introduction


The Illinois Natural History Survey (INHS) fish collection has graciously shared the 105,742 specimen records that is going to be used in this transformation example. The records were extracted as a Comma-Separated Values (CSV) file from the INHS FileMaker Pro database, and all specimen records were provided with a Globally Unique IDentifier (GUID). The GUID technology chosen by the INHS collection managers was an HTTP-based URI of the form:

http://biocoll.inhs.illinois.edu/fish/INHS<catalogue_number>

Once the data was received, the first step was to verify the uniqueness of the identifier (GUID). Checking can be quickly performed either using the unique filter of Excel or using the Unix 'uniq' command. This dataset was perfect with respect to this aspect.

Mapping terms to standard terms


The next step consisted of going through each field of the dataset, to gather information about the meaning of each field in order to properly map into a standard term. The conclusions of this exchange resulted in the following transformations:

  1. Mapped 'latitude'/'longitude' fields to 'verbatimLatitude' and 'verbatimLongitude' because not all lat/long values were in decimal format.
  2. Remove the 'GIS_Latitude_IL'/'GIS_Longitude_IL' since those coordinates have not been confirmed to be accurate
  3. Mapped 'specimen_remarks' to 'dwc:occurrenceRemarks' and 'remarks' to 'dwc:locationRemarks' since the latter are comments specific to a collecting location and the former are specific to a species collected at a location.
  4. Since there are no appropriate DwC terms for water-based locations, and MISC makes use of a hierarchical/ranked data model (not a flat data model), new terms ('inhs:location_Stream', 'inhs:location_RiverMile', and 'inhs:location_Basin') were created in INHS namespace. Streams represent the more specific location where the collection of the specimen took place, with river mile indicating the mileage along the stream, and basin being the larger water body.
  5. Concatenated all the water-based locations (separated by comma) into 'dwc:waterBody'.
  6. Created a new term 'inhs:locationTrs' to store Township Range Section (TRS) information. This term will also be recommended to get into MISC.
  7. Concatenated information from 'dwc:day', 'dwc:month', and 'dwc:year' whenever possible into 'dwc:eventDate' using the ISO 8601 format and including date ranges. Cases where it was not possible to generate an event date, included imprecise values such as 'Fall' or 'Spring'.
  8. Mapped the count of specimens that received special preparation into the MISC term 'idigbio:preparationCount'.
  9. Transformed and merged the acronyms used to indicate the various levels of species endangerment into the MISC term 'idigbio:endangeredStatus'. The acronym mappings used were: SE=State Endangered, FE=Federally Endangered, ST=State Threatened, FT=Federally Threatened, I=Introduced
  10. Duplicated the field for the specimen GUID (dwc:occurrenceID) to also indicate the record GUID (idigbio:recordID)
  11. All other information had trivial mappings into DwC terms, namely 'dwc:institutionCode', 'dwc:catalogNumber', 'dwc:family', 'dwc:scientificName', 'dwc:identifiedBy', 'dwc:locality', 'dwc:verbatimLocality', 'dwc:county', 'dwc:state', 'dwc:country', 'dwc:day', 'dwc:month', 'dwc:year', 'dwc:recordedBy', 'dwc:individualCount', 'dwc:preparation', 'dwc:fieldNumber', 'dwc:typeStatus'.

Go back to: Data Ingestion Guidance

Go back to: CYWG page