Excel Data Quality Toolkit: Difference between revisions

m
 
(19 intermediate revisions by 3 users not shown)
Line 32: Line 32:


=IF(A1>TODAY(),"ERROR","OK")
=IF(A1>TODAY(),"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Date is Suspiciously Old ===
=== Date is Suspiciously Old ===
Line 51: Line 53:


=IF(A1<B1,"ERROR","OK")
=IF(A1<B1,"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Year, Month, and Day Values Do Not Match Date ===
=== Year, Month, and Day Values Do Not Match Date ===
Line 62: Line 66:
   
   
=DAY(A1)
=DAY(A1)
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


== Geography ==
== Geography ==
Line 80: Line 86:


'''Solution:'''
'''Solution:'''
Excel does not have an easy solution for this problem.


=== Georeference Metadata with no Associated Georeference ===
=== Georeference Metadata with no Associated Georeference ===
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Elevation is Unlikely ===
=== Elevation is Unlikely ===
Line 99: Line 106:
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Coordinates ===
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.  


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Lower Geography Values are Provided, but No Higher Geography ===
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.


'''Solution:'''
'''Solution:''' Check for missing higher levels of geography with this formula where A1 is the country, B1 is the stateProvince, and C1 is the county.


=IF(ISBLANK(C1),IF(ISBLANK(B1),IF(ISBLANK(A1),"no geography","OK"),IF(ISBLANK(A1),"missing country","OK")),IF(ISBLANK(B1),IF(ISBLANK(A1),"missing state and country","missing state"),IF(ISBLANK(A1),"no country","OK")))
=IF(ISBLANK(C1),IF(ISBLANK(B1),IF(ISBLANK(A1),"no geography","OK"),IF(ISBLANK(A1),"missing country","OK")),IF(ISBLANK(B1),IF(ISBLANK(A1),"missing state and country","missing state"),IF(ISBLANK(A1),"no country","OK")))
{|class="wikitable"
|+ Sample Results
|-
! country !! stateProvince !! county !! check
|-
| || Texas || Kerr|| no country
|-
| USA || || Kerr|| missing state
|-
| || Texas || || missing country
|-
| USA || || || OK
|-
| || || Kerr || missing state and country
|-
| USA || Texas || Kerr || OK
|}
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Minimum and Maximum Elevation Values Mismatched ===
=== Minimum and Maximum Elevation Values Mismatched ===
Line 119: Line 146:


=IF(A1>B1,"ERROR","OK")
=IF(A1>B1,"ERROR","OK")
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]


=== Mismatched Country and CountryCode Values ===
=== Mismatched Country and CountryCode Values ===
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Mismatched Geographic Terms ===
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Missing Geodetic Datum ===
=== Missing Geodetic Datum ===
Line 206: Line 235:


'''Solution:'''
'''Solution:'''
# Open your dataset in Excel and highlight Row 1.
# From the Data toolbar, click "Filter".
# Navigate to the column containing your higher taxonomy values (e.g. dwc:order) and select the arrow icon next to the column header indicating that this column can now be filtered on.
# At the bottom of the pop-up window, deselect "Select All", scroll down, and then only check the "Blanks" box (you may need to scroll down in the pop-up window to locate it).
# Resolve missing higher taxonomy as you are able.


== Other Issues ==
== Other Issues ==
Line 220: Line 255:
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.


'''Solution:'''
'''Solution:''' Excel does not have an easy solution for this problem.


=== Invalid Individual Count ===
=== Invalid Individual Count ===
Line 312: Line 347:
| MaterialCitation || MaterialCitation
| MaterialCitation || MaterialCitation
|}
|}
[https://github.com/Jegelewicz/data_quality/blob/main/Excel_examples/2024%20Excel%20data%20quality%20checks.xlsx Sample file]
3

edits