Hackathon Challenge: Difference between revisions

Jump to navigation Jump to search
Line 113: Line 113:
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll".  This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull".  Same for NY01075792_lg.csv, and several other in the series.
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll".  This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull".  Same for NY01075792_lg.csv, and several other in the series.


""Gold Parsed CSV Files""
There are more errors in gold csv files. (Qianjin)
There are more errors in gold csv files. (Qianjin)


NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998
NY01075760_lg no datesetName  
NY01075760_lg no datesetName  
NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)
NY01075765_lg verbatimEventDate (Feb. 1898), it should be verbatimEventDate ( Feb 1898.)
NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber
NY01075766_lg decimalLatitude (White Horse Beach, between Manomet Pt. and Rocky Pt., Plymouth area), it should be locality or habitat; no catalogNumber
NY01075767_lg verbatimEventDate format
NY01075767_lg verbatimEventDate format
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)
NY01075768_lg country (canada), it hsould be (ca.)
NY01075768_lg country (canada), it hsould be (ca.)
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)
NY01075770_lg habitat (Host determined by A. R. Grant)
NY01075770_lg habitat (Host determined by A. R. Grant)
NY01075771_lg verbatimCoordinates mixed with verbatimLocality
NY01075771_lg verbatimCoordinates mixed with verbatimLocality
NY01075779_lg habitat concatenation
NY01075779_lg habitat concatenation
NY01075780_lg NEW YOUR BOTANICAL GARDEN
NY01075780_lg NEW YOUR BOTANICAL GARDEN
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.
NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075797_lg recordedBy ( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.
NY01075805_lg stateProvince (South Carolina) in the csv file; but it is (S.C.) in the text file.
NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075812_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075816_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075817_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075818_lg no scientificName
NY01075818_lg no scientificName
NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075819_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075820_lg recordedBy( William Russell Buck) in the csv file; but it is (William R. Buck) in the text file.
NY01075821_lg scientificName (null)
NY01075821_lg scientificName (null)
NY01075821_lg no scientificName
NY01075821_lg no scientificName
NY01075822_lg no scientificName
NY01075822_lg no scientificName
NY01075823_lg identifiedBy
NY01075823_lg identifiedBy
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file. 
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file. 
TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))
TENN-L-0000015_lg verbatimInstitution (TENNESSEE (TENN))
TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)
TENN-L-0000016_lg verbatimInstitution (HERBARIUM OF THE UNIVERSITY OF TENNESSEE)
TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000017_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000018_lg verbatimInstitution (University of Tennessee (TENN))
TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)
TENN-L-0000019_lg identifiedBy (Alt.Set.) in the csv file; verbatimEventDate (8 Aug 1954) is mixed with dateIdentified (8 Aug 1954)
TENN-L-0000021_lg verbatimInstitution ((TENN))
TENN-L-0000021_lg verbatimInstitution ((TENN))
TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)
TENN-L-0000022_lg verbatimEventDate (23 July 1955) is mixed with dateIdentified (23 July 1955)
TENN-L-0000033_lg no catalogNumber in OCRed text file
TENN-L-0000033_lg no catalogNumber in OCRed text file
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000036_lg verbatimEventDate (format)
TENN-L-0000045_lg recordNumber (null)
TENN-L-0000045_lg recordNumber (null)
TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.
TENN-L-0000045_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file.
TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)
TENN-L-0000048_lg verbatimLocality (near) is mixed with habitat (near)
TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.
TENN-L-0000050_lg stateProvince (Mont.) in the text file; but it is (Montana) in the csv file. verbatimElevation (Alt.: 6000 ft) in the csv file.
TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)
TENN-L-0000052_lg identifiedBy (Alt.: About 3500 ft.)
TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.
TENN-L-0000053_lg identifiedBy is on the 2nd line; dateIdentified is on the 2nd line.
TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.
TENN-L-0000054_lg identifiedBy (!A. skoepa) in the text file; but it is (A. skoepa) in the csv file.
TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)
TENN-L-0000056_lg oliff occurs in habitat but it is cliff in text file; dateIdentified (format)
TENN-L-0000063_lg verbatimLocality contains scientific name
TENN-L-0000063_lg verbatimLocality contains scientific name
TENN-L-0000063_lg verbatimScientificName (Amherst)
TENN-L-0000063_lg verbatimScientificName (Amherst)
TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)
TENN-L-0000064_lg recordedBy (H. A. Sierk) is mixed with identifiedBy (H. A. Sierk); verbatimEventDate (August 1, 1957) is mixed with dateIdentified (August 1, 1957)
TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)
TENN-L-0000065_lg recordedBy (A. J. Sharp) is mixed with identifiedBy (A. J. Sharp) verbatimEventDate (31 July, 1955) is mixed with dateIdentified (31 July, 1955)
TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)
TENN-L-0000068_lg verbatimLocality (edge of road near gorge); habitat (bark, edge of road)
TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.
TENN-L-0000072_lg verbatimCoordinates contains null in the csv file; (Lat. 40� N) is in text file.
TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.
TENN-L-0000076_lg stateProvince (Minn,) in the text file; but it is (Minnesota) in the csv file.
TENN-L-0000077_lg identifiedBy (Date) in the csv file
TENN-L-0000077_lg identifiedBy (Date) in the csv file
TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.
TENN-L-0000077_lg datasetName (Michigan FLORA OF) in the text file; but it is (FLORA OF Michigan) in the csv file.
TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)
TENN-L-0000083_lg no recordNumber in the csv file; DateIdentified (format)
TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)
TENN-L-0000083_lg verbatimEventDate (August 1 1957) is mixed with dateIdentified (August 1 1957)
TENN-L-0000084_lg scientificName (null)
TENN-L-0000084_lg scientificName (null)
TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file
TENN-L-0000089_lg verbatimCoordinates (Lat.40 N.) in the text file; but no verbatimCoordinates in the csv file
TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.
TENN-L-0000090_lg stateProvince (AK) in the csv file; but it is (ALASKA) in the text file.
WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.
WIS-L-0011728_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file.
WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.
WIS-L-0011730_lg stateProvince (AK) in the text file; but it is (ALASKA) in the csv file. habitat (Site: ) in the csv file.
WIS-L-0012026_lg no datasetName
WIS-L-0012026_lg no datasetName
WIS-L-0012038_lg no verbatimCoordinates
WIS-L-0012038_lg no verbatimCoordinates
WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.
WIS-L-0012040_lg locality (Cen- tral Brooks) in the text file; but it is (Central Brooks) in the csv file.
WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file
WIS-L-0012041_lg no datasetName in the csv file; no scientificName in the csv file; verbatimEventDate (format) in the csv file; dateIdentified (format) in the csv file
WIS-L-0012045_lg verbatimCoordinates concatenation
WIS-L-0012045_lg verbatimCoordinates concatenation
WIS-L-0012051_lg dateIdentified (format)
WIS-L-0012051_lg dateIdentified (format)
WIS-L-0012055_lg verbatimEventDate (format)
WIS-L-0012055_lg verbatimEventDate (format)
WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is  (2003-July-19) in the csv file
WIS-L-0012055_lg verbatimEventDate (19 July 2003) in the text file; but it is  (2003-July-19) in the csv file
WIS-L-0012056_lg dateIdentified (format)
WIS-L-0012056_lg dateIdentified (format)
WIS-L-0012057_lg no datesetName
WIS-L-0012057_lg no datesetName
WIS-L-0012064_lg verbatimCoordinates concatenation
WIS-L-0012064_lg verbatimCoordinates concatenation
WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file
WIS-L-0012073_lg identifiedBy (By P. Y. Wong) in the csv file
WIS-L-0012074_lg county (null)
WIS-L-0012074_lg county (null)
WIS-L-0012074_lg county (null)
WIS-L-0012074_lg county (null)
WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates
WIS-L-0012077_lg verbatimLocality contains verbatimCoordinates


7

edits

Navigation menu