Data Quality Toolkit 2024: Difference between revisions
(Remove references to OpenRefine) |
(added TW dq link to text in opening paragraph) |
||
(One intermediate revision by the same user not shown) | |||
Line 9: | Line 9: | ||
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags]. | This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook], GBIF's [https://data-blog.gbif.org/post/issues-and-flags/ data quality flags], and iDigBio's [https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags data quality flags]. | ||
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[Specify Data Quality Toolkit|Specify]], [https://biokic.github.io/symbiota-docs/editor/quality/ Symbiota], [ | If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: [[Arctos Data Quality Toolkit|Arctos]], [[Excel Data Quality Toolkit|Excel]], [[Specify Data Quality Toolkit|Specify]], [https://biokic.github.io/symbiota-docs/editor/quality/ Symbiota], [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks Data Quality Help and Hints]]. Additional command line tools can be found in Bob Mesibov's [https://www.datafix.com.au/darwin-core-checker/ Darwin Core Checker tool]. | ||
== Catalog Numbers and Other Identifiers== | == Catalog Numbers and Other Identifiers== | ||
Line 21: | Line 21: | ||
* [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]] | * [[Specify Data Quality Toolkit#Duplicate Catalog Numbers|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#duplicate-catalog-numbers Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#duplicate-catalog-numbers Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
== Dates == | == Dates == | ||
Line 33: | Line 33: | ||
* [[Specify Data Quality Toolkit#Date Hasn't Happened Yet|Specify]] | * [[Specify Data Quality Toolkit#Date Hasn't Happened Yet|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-hasnt-happened-yet Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#date-hasnt-happened-yet Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Date is Suspiciously Old === | === Date is Suspiciously Old === | ||
Line 44: | Line 44: | ||
* [[Specify Data Quality Toolkit#Date is Suspiciously Old|Specify]] | * [[Specify Data Quality Toolkit#Date is Suspiciously Old|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#date-is-suspiciously-old Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#date-is-suspiciously-old Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Identified Date Earlier than Collected Date === | === Identified Date Earlier than Collected Date === | ||
Line 54: | Line 54: | ||
* [[Specify Data Quality Toolkit#Identified Date Earlier than Collected Date|Specify]] | * [[Specify Data Quality Toolkit#Identified Date Earlier than Collected Date|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#identified-date-earlier-than-collected-date Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#identified-date-earlier-than-collected-date Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Year, Month, and Day Values Do Not Match Date === | === Year, Month, and Day Values Do Not Match Date === | ||
Line 64: | Line 64: | ||
* [[Specify Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Specify]] | * [[Specify Data Quality Toolkit#Year, Month, and Day Values Do Not Match Date|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#year-month-and-day-values-do-not-match-date Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#year-month-and-day-values-do-not-match-date Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
== Geography == | == Geography == | ||
Line 76: | Line 76: | ||
* [[Specify Data Quality Toolkit#Coordinates are Zero|Specify]] | * [[Specify Data Quality Toolkit#Coordinates are Zero|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-are-zero Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-are-zero Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Coordinates Do Not Fall Within Named Geographic Unit === | === Coordinates Do Not Fall Within Named Geographic Unit === | ||
Line 86: | Line 86: | ||
* [[Specify Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Specify]] | * [[Specify Data Quality Toolkit#Coordinates Do Not Fall Within Named Geographic Unit|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-do-not-fall-within-named-geographic-unit Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#coordinates-do-not-fall-within-named-geographic-unit Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Georeference Metadata with no Associated Georeference === | === Georeference Metadata with no Associated Georeference === | ||
Line 96: | Line 96: | ||
* [[Specify Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Specify]] | * [[Specify Data Quality Toolkit#Georeference Metadata with no Associated Georeference|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#georeference-metadata-with-no-associated-georeference Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#georeference-metadata-with-no-associated-georeference Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Elevation is Unlikely === | === Elevation is Unlikely === | ||
Line 106: | Line 106: | ||
* [[Specify Data Quality Toolkit#Elevation is Unlikely|Specify]] | * [[Specify Data Quality Toolkit#Elevation is Unlikely|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#elevation-is-unlikely Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#elevation-is-unlikely Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Improperly Negated Latitudes/Longitudes === | === Improperly Negated Latitudes/Longitudes === | ||
Line 116: | Line 116: | ||
* [[Specify Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Specify]] | * [[Specify Data Quality Toolkit#Improperly Negated Latitudes/Longitudes|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#improperly-negated-latitudeslongitudes Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#improperly-negated-latitudeslongitudes Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Invalid Coordinates === | === Invalid Coordinates === | ||
Line 126: | Line 126: | ||
* [[Specify Data Quality Toolkit#Invalid Coordinates|Specify]] | * [[Specify Data Quality Toolkit#Invalid Coordinates|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-coordinates Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-coordinates Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Lower Geography Values are Provided, but No Higher Geography === | === Lower Geography Values are Provided, but No Higher Geography === | ||
Line 136: | Line 136: | ||
* [[Specify Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Specify]] | * [[Specify Data Quality Toolkit#Lower Geography Values are Provided, but No Higher Geography|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#lower-geography-values-are-provided-but-no-higher-geography Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#lower-geography-values-are-provided-but-no-higher-geography Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Minimum and Maximum Elevation Values Mismatched === | === Minimum and Maximum Elevation Values Mismatched === | ||
Line 146: | Line 146: | ||
* [[Specify Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Specify]] | * [[Specify Data Quality Toolkit#Minimum and Maximum Elevation Values Mismatched|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#minimum-and-maximum-elevation-values-mismatched Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#minimum-and-maximum-elevation-values-mismatched Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Mismatched Country and CountryCode Values === | === Mismatched Country and CountryCode Values === | ||
Line 156: | Line 156: | ||
* [[Specify Data Quality Toolkit#Mismatched Country and CountryCode Values|Specify]] | * [[Specify Data Quality Toolkit#Mismatched Country and CountryCode Values|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-country-and-countrycode-values Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-country-and-countrycode-values Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Mismatched Geographic Terms === | === Mismatched Geographic Terms === | ||
Line 166: | Line 166: | ||
* [[Specify Data Quality Toolkit#Mismatched Geographic Terms|Specify]] | * [[Specify Data Quality Toolkit#Mismatched Geographic Terms|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-geographic-terms Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#mismatched-geographic-terms Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Missing Geodetic Datum === | === Missing Geodetic Datum === | ||
Line 176: | Line 176: | ||
* [[Specify Data Quality Toolkit#Missing Geodetic Datum|Specify]] | * [[Specify Data Quality Toolkit#Missing Geodetic Datum|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-geodetic-datum Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#missing-geodetic-datum Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Missing Latitudes/Longitudes === | === Missing Latitudes/Longitudes === | ||
Line 186: | Line 186: | ||
* [[Specify Data Quality Toolkit#Missing Latitudes/Longitudes|Specify]] | * [[Specify Data Quality Toolkit#Missing Latitudes/Longitudes|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#missing-latitudeslongitudes Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#missing-latitudeslongitudes Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Misspelled Geographic Unit Names === | === Misspelled Geographic Unit Names === | ||
Line 196: | Line 196: | ||
* [[Specify Data Quality Toolkit#Misspelled Geographic Unit Names|Specify]] | * [[Specify Data Quality Toolkit#Misspelled Geographic Unit Names|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-geographic-unit-names Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-geographic-unit-names Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
== Taxonomy == | == Taxonomy == | ||
Line 208: | Line 208: | ||
* [[Specify Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Specify]] | * [[Specify Data Quality Toolkit#Misspelled or Invalid Taxonomic Names|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-or-invalid-taxonomic-names Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#misspelled-or-invalid-taxonomic-names Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Unknown Higher Taxonomy === | === Unknown Higher Taxonomy === | ||
Line 218: | Line 218: | ||
* [[Specify Data Quality Toolkit#Unknown Higher Taxonomy|Specify]] | * [[Specify Data Quality Toolkit#Unknown Higher Taxonomy|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#unknown-higher-taxonomy Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#unknown-higher-taxonomy Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
== Other Issues == | == Other Issues == | ||
Line 230: | Line 230: | ||
* [[Specify Data Quality Toolkit#Incorrect Character Encodings|Specify]] | * [[Specify Data Quality Toolkit#Incorrect Character Encodings|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-character-encodings Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-character-encodings Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Incorrect Line Endings === | === Incorrect Line Endings === | ||
Line 240: | Line 240: | ||
* [[Specify Data Quality Toolkit#Incorrect Line Endings|Specify]] | * [[Specify Data Quality Toolkit#Incorrect Line Endings|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-line-endings Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#incorrect-line-endings Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Invalid Individual Count === | === Invalid Individual Count === | ||
Line 250: | Line 250: | ||
* [[Specify Data Quality Toolkit#Invalid Individual Count|Specify]] | * [[Specify Data Quality Toolkit#Invalid Individual Count|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-individual-count Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#invalid-individual-count Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] | ||
=== Non-standardized BasisOfRecord Values === | === Non-standardized BasisOfRecord Values === | ||
Line 264: | Line 264: | ||
* [[Specify Data Quality Toolkit#Non-standardized BasisOfRecord Values|Specify]] | * [[Specify Data Quality Toolkit#Non-standardized BasisOfRecord Values|Specify]] | ||
* [https://biokic.github.io/symbiota-docs/editor/quality/#non-standardized-basisofrecord-values Symbiota] | * [https://biokic.github.io/symbiota-docs/editor/quality/#non-standardized-basisofrecord-values Symbiota] | ||
* [ | * [https://docs.taxonworks.org/guide/data-quality.html TaxonWorks] |
Latest revision as of 12:14, 25 May 2024
Overview
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and links to resources for identifying and fixing the issues are provided.
This page was inspired by Bob Mesibov's Data Cleaner's Cookbook, GBIF's data quality flags, and iDigBio's data quality flags.
If you already know which tool or CMS you are using to clean your data, you can visit a tool- and CMS-specific toolkit: Arctos, Excel, Specify, Symbiota, TaxonWorks Data Quality Help and Hints]. Additional command line tools can be found in Bob Mesibov's Darwin Core Checker tool.
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
Solutions:
Dates
Date Hasn't Happened Yet
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is in the future.
Solutions:
Date is Suspiciously Old
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
Solutions:
Identified Date Earlier than Collected Date
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
Solutions:
Year, Month, and Day Values Do Not Match Date
Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.
Solutions:
Geography
Coordinates are Zero
Problem: The provided latitude and longitude values are 0.
Solutions:
Coordinates Do Not Fall Within Named Geographic Unit
Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
Solutions:
Georeference Metadata with no Associated Georeference
Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).
Solutions:
Elevation is Unlikely
Problem: Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
Solutions:
Improperly Negated Latitudes/Longitudes
Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
Solutions:
Invalid Coordinates
Problem: Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
Solutions:
Lower Geography Values are Provided, but No Higher Geography
Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
Solutions:
Minimum and Maximum Elevation Values Mismatched
Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).
Solutions:
Mismatched Country and CountryCode Values
Problem: The provided value for country and countryCode do not match.
Solutions:
Mismatched Geographic Terms
Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
Solutions:
Missing Geodetic Datum
Problem: Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
Solutions:
Missing Latitudes/Longitudes
Problem: A record has a latitude value, but not a longitude value, or vice versa.
Solutions:
Misspelled Geographic Unit Names
Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
Solutions:
Taxonomy
Misspelled or Invalid Taxonomic Names
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
Solutions:
Unknown Higher Taxonomy
Problem: Species may be missing higher taxonomic information.
Solutions:
Other Issues
Incorrect Character Encodings
Problem: Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
Solutions:
Incorrect Line Endings
Problem: When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
Solutions:
Invalid Individual Count
Problem: individualCount values are not positive integers.
Solutions:
Non-standardized BasisOfRecord Values
Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Solutions: