Understanding iDigBio's data downloads

Research Spotlight: December 2016

 

Contributed by-- Matthew Collins, Dan Stoner, and Alex Thompson

 

Downloaded data from iDigBio serve as a base for important biodiversity research. It is important to understand how to interpret the way data are represented in the Darwin Core Archives (DwC-A) that you retrieve from our download system either through the portal or the download API.

 

The DwC-A downloads from iDigBio are zip files that are named with a 36 character version 4 Universally Unique IDentifier (UUID). These Globally Unique IDentifiers (GUIDs) are generated for each download and can be trusted to identify exactly one download file. The contents of a download are determined by the parameters of the search query used, the records in iDigBio at the time the query is run, and the code that we wrote that generates the download. Because some or all of these may change over time, we generate a new identifier for every download.

 

 

Once a download file is generated, it is served from iDigBio’s storage cluster, the same place all the images and other important files are stored in iDigBio. iDigBio has no plans at this time to remove generated download files from our storage cluster so you can safely re-download a file or share a download link with your collaborators while you do your work. You can use the UUID of a download to construct a URL to retrieve the file if you lose the link that was sent to you via email during the download process. Download URLs have this format:

 

http://s.idigbio.org/idigbio-downloads/<UUID>.zip

 

An example looks like:

 

http://s.idigbio.org/idigbio-downloads/7043a662-1707-4c25-8279-2eed45a3b...

 

We recommend researchers leverage services that have data archiving as their core mission such as Data Dryad, Biodiversity Data Journal, or DataOne for permanently archiving your research data. iDigBio serves the roles of data aggregator and data distributor. For more information on archiving research data, including updates and augmentations you make, see our blog post Can iDigBio be my research data repository?

 

Taking a look at the contents of the zip file, the base file is the meta.xml. This file describes what the other CSV files in the archive contain, both the concept contained in each row (occurrences or multimedia) and the values contained in each column. This file is mostly used by machines and data linking processes to make sure the same concepts are matched between different data sources. For more information on the meta.xml see the DWC-A specification and the Global Biodiversity Information Facility’s (GBIF) DWC-A how-to document.

 

The data in the download is found in two pairs of files: occurrence.csv and occurrence_raw.csv, and multimedia.csv and multimedia_raw.csv. For most uses, the occurrence.csv and multimedia.csv files are the best ones to use.

 

To understand why, it will help to understand what the *_raw.csv files contain. They have the data values from our providers exactly as the provider sends them to us. During the process of data ingestion, we treat all data as verbatim text. The contents of the *_raw.csv files are simply copies of this text with all the non-English language characters, punctuation, typos, placeholder values, and any other conventions that a provider might choose use in their collections.

 

Why are these not the best data to use for research? Many data fields have constraints on them such as controlled vocabularies (taxonomy, countries), standardized representations (dates, geographic points), or valid data ranges (latitude and longitude). Lots of data that we receive does not fit these constraints. We pass many fields through a data interpretation process that results in our adding or updating fields with the values that we believe the provider intended. We also convert the text into a typed representation such as numbers and dates for those fields where it makes sense so things like searching using ranges of values will work. This interpreted data is what is used in our indexes and is what you default to searching against in the portal, through the API, and packages like ridigbio.

 

You can read more about the flags we use to indicate what data we have changed and how you can use this information in your data quality workflows in other posts.

 

The result of our standardizations are found in the occurrence.csv and media.csv files. These contain our best guess at the information that our providers intended to convey. They also contain fewer fields because they represent what is in our high-speed searching index. iDigBio accepts any data field someone chooses to provide to us. This means providers can include data that is specific to their database or research projects but it also means there are many fields that only one or a few providers send us. To cut back on what we index, we don’t include less common fields. Also, we try to limit the space used by our high-speed searching system and it doesn’t make sense to search some of the larger data fields in any other way besides full text searches.

 

Of course the raw data is in the download as well so you can see what changes we made, look at specialized fields providers sent us, or read through more verbose data fields.

 

The citations.txt file is not described by the Darwin Core archive specification. We added it to provide documentation on how to cite the records included in the download. In this file is all the information you need to cite the data in the download according to GBIF’s citation suggestions and described in our Terms of Use. This includes the number of records and date accessed, the query in iDigBio’s documented query format that was used to select the records, and how many records came from each our providers’ recordset identifiers. These records sets are ultimately where the citations should lead. It is critical that they are included all citations so collections can track the use of their data and credit can be given to the work our providers do to make their data available.

 

While including the record sets in your citations is sufficient for sharing credit for the work of collections, it does not allow for the reproducibility of analysis. In order for others to repeat your work (including yourself in the future!) you will need to either archive your complete dataset and cite its permanent identifier or include in your publication a complete list of the occurrence ids and iDigBio record ids of all the specimens you used. A taxonomic treatment, which usually uses a very small number of specimens, is a great case where you should include the occurrences ids and iDigBio record ids of the specimens directly in your publications. Publishers like Pensoft provide online tools during their publication submission process that allow you to build links to online databases easily right at the time of submission.

 
For more information about our data processes and how to use our data, feel free to email help@idigbio.org.