Data Quality Toolkit 2024
Overview
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.
This page was inspired by Bob Mesibov's Data Cleaner's Cookbook.
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: Arctos, Excel, OpenRefine, Specify, Symbiota, TaxonWorks.
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Dates
Identified Date Earlier than Collected Date
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Geography
Georeference Metadata with no Associated Georeference
Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Improperly Negated Latitudes/Longitudes
Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Minimum and Maximum Elevation Values Mismatched
Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Missing Latitudes/Longitudes
Problem: A record has a latitude value, but not a longitude value, or vice versa.
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Misspelled Geographic Unit Names
Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Taxonomy
Misspelled Taxonomic Names
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks
Other Issues
Non-standardized BasisOfRecord Values
Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Solutions:
- Arctos
- Excel
- OpenRefine
- Specify
- Symbiota
- TaxonWorks