Excel Data Quality Toolkit: Difference between revisions

Jump to navigation Jump to search
m
 
(24 intermediate revisions by 3 users not shown)
Line 32: Line 32:


=IF(A1>TODAY(),"ERROR","OK")
=IF(A1>TODAY(),"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Date is Suspiciously Old ===
=== Date is Suspiciously Old ===
Line 51: Line 53:


=IF(A1<B1,"ERROR","OK")
=IF(A1<B1,"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Year, Month, and Day Values Do Not Match Date ===
=== Year, Month, and Day Values Do Not Match Date ===
Line 62: Line 66:
   
   
=DAY(A1)
=DAY(A1)
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


== Geography ==
== Geography ==
Line 80: Line 86:


'''Solution:'''
'''Solution:'''
Excel does not have an easy solution for this problem.


=== Georeference Metadata with no Associated Georeference ===
=== Georeference Metadata with no Associated Georeference ===
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Elevation is Unlikely ===
=== Elevation is Unlikely ===
Line 99: Line 106:
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Coordinates ===
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Lower Geography Values are Provided, but No Higher Geography ===
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.


'''Solution:'''
'''Solution:''' Check for missing higher levels of geography with this formula where A1 is the country, B1 is the stateProvince, and C1 is the county.
 
=IF(ISBLANK(C1),IF(ISBLANK(B1),IF(ISBLANK(A1),"no geography","OK"),IF(ISBLANK(A1),"missing country","OK")),IF(ISBLANK(B1),IF(ISBLANK(A1),"missing state and country","missing state"),IF(ISBLANK(A1),"no country","OK")))
 
{|class="wikitable"
|+ Sample Results
|-
! country !! stateProvince !! county !! check
|-
| || Texas || Kerr|| no country
|-
| USA || || Kerr|| missing state
|-
| || Texas || || missing country
|-
| USA || || || OK
|-
| || || Kerr || missing state and country
|-
| USA || Texas || Kerr || OK
|}
 
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Minimum and Maximum Elevation Values Mismatched ===
=== Minimum and Maximum Elevation Values Mismatched ===
Line 117: Line 146:


=IF(A1>B1,"ERROR","OK")
=IF(A1>B1,"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Mismatched Country and CountryCode Values ===
=== Mismatched Country and CountryCode Values ===
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Mismatched Geographic Terms ===
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Missing Geodetic Datum ===
=== Missing Geodetic Datum ===
Line 163: Line 194:
'''Problem:''' The geographic units (e.g., [https://dwc.tdwg.org/terms/#dwc:country country], [https://dwc.tdwg.org/terms/#dwc:stateProvince state/province], [https://dwc.tdwg.org/terms/#dwc:county county]) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
'''Problem:''' The geographic units (e.g., [https://dwc.tdwg.org/terms/#dwc:country country], [https://dwc.tdwg.org/terms/#dwc:stateProvince state/province], [https://dwc.tdwg.org/terms/#dwc:county county]) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.


'''Solution:'''
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]
 
If data is already entered, check for non-standard values with this formula Where A1 is the geographic unit value being tested and the list to test against is on the geographic unit tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.
 
'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
 
=VLOOKUP(A1,'geographic unit'!$A$1:$A$11,1,FALSE)
 
This same process can be used to solve issues. Get a list of unique values in your geographic unit column by copying the entire column to column A in a new tab, highlight the copied data and from the main Excel menu select Data->Remove duplicates. In column B add the correct scientific name that should be used for every term in column A even if the two are the same.
 
Use a LOOKUP like the one above to get the correct scientific name for every row in your file.
 
=VLOOKUP(A1,'unique geographic unit'!$A$1:$B$11,2,FALSE)
 
Note that the unique values to check are in column A of the unique geographic unit tab and the correct replacement values are in column B of the unique geographic unit tab. Also note that the value being returned is from column B indicated as the 2nd column in the array.


== Taxonomy ==
== Taxonomy ==
Line 172: Line 217:
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]
'''Solution:''' Avoid this issue with a pick list as data are entered. For a tutorial see [https://support.microsoft.com/en-us/office/create-a-drop-down-list-7693307a-59ef-400a-b769-c5402dce407b Create a drop-down list]


If data is already entered, check for non-standard values with this formula Where A1 is the Scientific name value being tested and the list to test against is on the taxa tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.
If data is already entered, check for non-standard values with this formula Where A1 is the scientific name value being tested and the list to test against is on the taxa tab in cells A1 through A11. Any value in column A that does not match a value in the list to test against will result in #N/A.


'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
'''WARNING!''' Note the '''$''' in front of the letters and numbers that comprise the list to test against. Leaving these off may result in false errors as the test will run against blank cells if the formula is copied past row 11 (or whatever the last row number is in the list to test against), so make sure those $ are in place!
Line 190: Line 235:


'''Solution:'''
'''Solution:'''
# Open your dataset in Excel and highlight Row 1.
# From the Data toolbar, click "Filter".
# Navigate to the column containing your higher taxonomy values (e.g. dwc:order) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
# Resolve missing higher taxonomy as you are able.


== Other Issues ==
== Other Issues ==
Line 204: Line 255:
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Individual Count ===
=== Invalid Individual Count ===
Line 296: Line 347:
| MaterialCitation || MaterialCitation
| MaterialCitation || MaterialCitation
|}
|}
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]
3

edits

Navigation menu