Excel Data Quality Toolkit: Difference between revisions

m
 
(46 intermediate revisions by 4 users not shown)
Line 29: Line 29:
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is in the future.
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is in the future.


'''Solution:'''
'''Solution:''' Check that dates are before today with this formula where A1 is the date you want to test.
 
=IF(A1>TODAY(),"ERROR","OK")
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Date is Suspiciously Old ===
=== Date is Suspiciously Old ===
Line 46: Line 50:
'''Problem:''' The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
'''Problem:''' The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).


'''Solution:'''
'''Solution:''' Check that collection dates are before identification dates with this formula where A1 is the identification date and B1 is the collection date.
 
=IF(A1<B1,"ERROR","OK")
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Year, Month, and Day Values Do Not Match Date ===
=== Year, Month, and Day Values Do Not Match Date ===
'''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens.
'''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens.


'''Solution:'''
'''Solution:''' Avoid this problem by letting Excel calculate year, month, and day for you using these formulas where A1 is the event date.
 
=YEAR(A1)
 
=MONTH(A1)
=DAY(A1)
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


== Geography ==
== Geography ==
Line 65: Line 81:
# At the bottom of the pop-up window, change the filter criterion "Choose One" to "Equals", enter "0" in the box to the right, and select "Apply Filter".
# At the bottom of the pop-up window, change the filter criterion "Choose One" to "Equals", enter "0" in the box to the right, and select "Apply Filter".
# If desired, steps 3-4 can be repeated on the column containing your longitude values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]).
# If desired, steps 3-4 can be repeated on the column containing your longitude values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]).
# Proceed to correct coordinate values as appropriate.


=== Coordinates Do Not Fall Within Named Geographic Unit ===
=== Coordinates Do Not Fall Within Named Geographic Unit ===
Line 71: Line 86:


'''Solution:'''
'''Solution:'''
Excel does not have an easy solution for this problem.


=== Georeference Metadata with no Associated Georeference ===
=== Georeference Metadata with no Associated Georeference ===
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Elevation is Unlikely ===
=== Elevation is Unlikely ===
Line 85: Line 101:
# From the Data toolbar, click "Filter".
# From the Data toolbar, click "Filter".
# Navigate to the column containing your elevation values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:minimumElevationInMeters minimumElevationInMeters], dwc:[https://dwc.tdwg.org/terms/#dwc:maximumElevationInMeters maximumElevationInMeters]) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# Navigate to the column containing your elevation values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:minimumElevationInMeters minimumElevationInMeters], dwc:[https://dwc.tdwg.org/terms/#dwc:maximumElevationInMeters maximumElevationInMeters]) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# At the bottom of the pop-up window, change the filter criterion "Choose One" to "Between", set "Less Than" = "0" and "Greater Than" = "-17000", and select "Apply Filter".
# At the bottom of the pop-up window, change the filter criterion "Choose One" to "Between", set "Less Than" = "-11000" and "Greater Than" = "17000", and select "Apply Filter".


=== Improperly Negated Latitudes/Longitudes ===
=== Improperly Negated Latitudes/Longitudes ===
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Coordinates ===
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Lower Geography Values are Provided, but No Higher Geography ===
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.


'''Solution:'''
'''Solution:''' Check for missing higher levels of geography with this formula where A1 is the country, B1 is the stateProvince, and C1 is the county.
 
=IF(ISBLANK(C1),IF(ISBLANK(B1),IF(ISBLANK(A1),"no geography","OK"),IF(ISBLANK(A1),"missing country","OK")),IF(ISBLANK(B1),IF(ISBLANK(A1),"missing state and country","missing state"),IF(ISBLANK(A1),"no country","OK")))
 
{|class="wikitable"
|+ Sample Results
|-
! country !! stateProvince !! county !! check
|-
| || Texas || Kerr|| no country
|-
| USA || || Kerr|| missing state
|-
| || Texas || || missing country
|-
| USA || || || OK
|-
| || || Kerr || missing state and country
|-
| USA || Texas || Kerr || OK
|}
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Minimum and Maximum Elevation Values Mismatched ===
=== Minimum and Maximum Elevation Values Mismatched ===
'''Problem:''' The minimum elevation ([https://dwc.tdwg.org/terms/#dwc:minimumElevationInMeters minimumElevationInMeters]) has a greater value than the maximum elevation ([https://dwc.tdwg.org/terms/#dwc:maximumElevationInMeters maximumElevationInMeters]).
'''Problem:''' The minimum elevation ([https://dwc.tdwg.org/terms/#dwc:minimumElevationInMeters minimumElevationInMeters]) has a greater value than the maximum elevation ([https://dwc.tdwg.org/terms/#dwc:maximumElevationInMeters maximumElevationInMeters]).


'''Solution:'''
'''Solution:''' Check that minimum elevation is below maximum elevation with this formula where A1 is the minimum elevation and B1 is the maximum elevation.
 
=IF(A1>B1,"ERROR","OK")
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Mismatched Country and CountryCode Values ===
=== Mismatched Country and CountryCode Values ===
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Mismatched Geographic Terms ===
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Missing Geodetic Datum ===
=== Missing Geodetic Datum ===
Line 126: Line 168:
# Navigate to the column containing your geodetic datum values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum]) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# Navigate to the column containing your geodetic datum values (e.g. dwc:[https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum]) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# At the bottom of the pop-up window, deselect "Select All" and then only check the "Blanks" box.
# At the bottom of the pop-up window, deselect "Select All" and then only check the "Blanks" box.
# Proceed to add geodetic datum values as appropriate where missing.


=== Missing Latitudes/Longitudes ===
=== Missing Latitudes/Longitudes ===
Line 153: Line 194:
'''Problem:''' The geographic units (e.g., [https://dwc.tdwg.org/terms/#dwc:country country], [https://dwc.tdwg.org/terms/#dwc:stateProvince state/province], [https://dwc.tdwg.org/terms/#dwc:county county]) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
'''Problem:''' The geographic units (e.g., [https://dwc.tdwg.org/terms/#dwc:country country], [https://dwc.tdwg.org/terms/#dwc:stateProvince state/province], [https://dwc.tdwg.org/terms/#dwc:county county]) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.


'''Solution:'''
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]
 
If data is already entered, check for non-standard values with this formula Where A1 is the geographic unit value being tested and the list to test against is on the geographic unit tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.
 
'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
 
=VLOOKUP(A1,'geographic unit'!$A$1:$A$11,1,FALSE)
 
This same process can be used to solve issues. Get a list of unique values in your geographic unit column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct scientific name that should be used for every term in column A even if the two are the same.
 
Use a LOOKUP like the one above to get the correct scientific name for every row in your file.
 
=VLOOKUP(A1,'unique geographic unit'!$A$1:$B$11,2,FALSE)
 
Note that the unique values to check are in column A of the unique geographic unit tab and the correct replacement values are in column B of the unique geographic unit tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.


== Taxonomy ==
== Taxonomy ==
Line 160: Line 215:
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.


'''Solution:'''
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]
 
If data is already entered, check for non-standard values with this formula Where A1 is the scientific name value being tested and the list to test against is on the taxa tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.
 
'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
 
=VLOOKUP(A1,'taxa'!$A$1:$A$11,1,FALSE)
 
This same process can be used to solve issues. Get a list of unique values in your scientific name column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct scientific name that should be used for every term in column A even if the two are the same.
 
Use a LOOKUP like the one above to get the correct scientific name for every row in your file.
 
=VLOOKUP(A1,'unique taxa'!$A$1:$B$11,2,FALSE)
 
Note that the unique values to check are in column A of the unique taxa tab and the correct replacement values are in column B of the unique taxa tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.


=== Unknown Higher Taxonomy ===
=== Unknown Higher Taxonomy ===
Line 166: Line 235:


'''Solution:'''
'''Solution:'''
# Open your dataset in Excel and highlight Row 1.
# From the Data toolbar, click "Filter".
# Navigate to the column containing your higher taxonomy values (e.g. dwc:order) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
# Resolve missing higher taxonomy as you are able.


== Other Issues ==
== Other Issues ==
Line 180: Line 255:
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Individual Count ===
=== Invalid Individual Count ===
Line 200: Line 275:
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.


'''Solution:'''
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]
 
If data is already entered, check for non-standard values with this formula Where A1 is the basisOfRecord value being tested and the list to test against is on the basis of record tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.
 
'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
 
=VLOOKUP(A1,'basis of record'!$A$1:$A$11,1,FALSE)
 
{|class="wikitable"
|+ Sample Results
|-
!A !! B
|-
| MaterialEntity || MaterialEntity
|-
| PreservedSpecimen || PreservedSpecimen
|-
| FossilSpecimen || FossilSpecimen
|-
| LivingSpecimen || LivingSpecimen
|-
| MaterialSample || MaterialSample
|-
| Preserved Specimen || #N/A
|-
| HumanObservation || HumanObservation
|-
| MachineObservation || MachineObservation
|-
| fossil || #N/A
|-
| tissue || #N/A
|-
| MaterialCitation | MaterialCitation
|}
 
 
This same process can be used to solve issues. Get a list of unique values in your basisOfRecord column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct basisOfRecord term that should be used for every term in column A even if the two are the same.
 
Use a LOOKUP like the one above to get the correct basisOfRecord term for every row in your file.
 
=VLOOKUP(A1,'unique basis of record'!$A$1:$B$11,2,FALSE)
 
Note that the unique values to check are in column A of the unique basis of record tab and the correct replacement values are in column B of the unique basis of record tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.
 
{|class="wikitable"
|+ Sample Results
|-
!A !! B
|-
| MaterialEntity || MaterialEntity
|-
| PreservedSpecimen || PreservedSpecimen
|-
| FossilSpecimen || FossilSpecimen
|-
| LivingSpecimen || LivingSpecimen
|-
| MaterialSample || MaterialSample
|-
| Preserved Specimen || PreservedSpecimen
|-
| HumanObservation || HumanObservation
|-
| MachineObservation || MachineObservation
|-
| fossil || FossilSpecimen
|-
| tissue || MaterialEntity
|-
| MaterialCitation || MaterialCitation
|}
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]
3

edits