Dataset Errata: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 19: Line 19:
<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  
<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  


-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. (Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. (Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )  


-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;(Bryan: Agreed. Should be fixed to match the label.)
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  


-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)&nbsp;(Bryan: Agreed. Should be fixed to match the label.)
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  


Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)&nbsp;(Bryan: Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)&nbsp;(Bryan: Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)  


Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.&nbsp;(Bryan: Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.&nbsp;(Bryan: Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)  


Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)  


Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).&nbsp;(Bryan: hmmm. Gold shoudl be as-if the OCR engine read the label with no mistakes. Silver would leave it as 0107576 but in Gold we shoudl put what was on the label. That would include the "1".)
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).&nbsp;(Bryan: hmmm. Gold shoudl be as-if the OCR engine read the label with no mistakes. Silver would leave it as 0107576 but in Gold we shoudl put what was on the label. That would include the "1".)  


Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)
Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  


Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the Label.)
Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the Label.)  


Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"&nbsp;(Bryan: Agreed. Should be fixed to match the label.)
Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  


Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.(Bryan: Agreed. Should be fixed to match the Label.)  
Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.(Bryan: Agreed. Should be fixed to match the Label.)  


Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...? (Bryan: This is complex but in this case I think it should be "Town of Peru". The bald "Peru" could be read as an error with the misplacment of "Country". Likely the Label author was worried about the same thing and included "Town of". In both cases include what is on the label.)
Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...? (Bryan: This is complex but in this case I think it should be "Town of Peru". The bald "Peru" could be read as an error with the misplacment of "Country". Likely the Label author was worried about the same thing and included "Town of". In both cases include what is on the label.)  


Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"&nbsp;(Bryan: Agreed. Should be assigned to "country".)
Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"&nbsp;(Bryan: Agreed. Should be assigned to "country".)  


Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)
Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  


Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others. &nbsp;(Bryan: Agreed. Should be fixed to match the label. I think if the OCR had been perfect the space would not be n the OCR file do it is a tough call.)
Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others. &nbsp;(Bryan: Agreed. Should be fixed to match the label. I think if the OCR had been perfect the space would not be n the OCR file do it is a tough call.)  


Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)  


'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)  
'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)  
'''(Bryan: I agree with Qianjin's edits except as noted below)'''


NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998  
NY01075759_lg verbatimEventDate (1998-04-19), it should be 19 April 1998  
Line 65: Line 66:
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)  
NY01075767_lg verbatimEventDate (July 1979), it should be (Jul-79)  


NY01075768_lg country (canada), it hsould be (ca.)  
NY01075768_lg country (canada), it should be (ca.)&nbsp;


NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)  
NY01075770_lg habitat (on Acmaea digitalis Eschsch. Host determined by A. R. Grant) and identifiedBy (A. R. Grant.)  
Line 73: Line 74:
NY01075771_lg verbatimCoordinates mixed with verbatimLocality  
NY01075771_lg verbatimCoordinates mixed with verbatimLocality  


NY01075779_lg habitat concatenation  
NY01075779_lg habitat concatenation (Bryan: "on Protoblastenia rupestris" appears before the location and habitate section of the label. However, Habitate says "dolomite rock along lake shore and adjacent Thuja forest; on Protoblastenia rupestris". There was a period after "forest". The period was removed and a ";" added. Then the "on Protoblastenia rupestris" from the earlier part of the label was concatinated.&nbsp;


NY01075780_lg NEW YOUR BOTANICAL GARDEN  
NY01075780_lg NEW YOUR BOTANICAL GARDEN (Bryan: the label said "GARDEN". The OCR said "CARDEN". SIlver should be "CARDEN" Gold should be "GARDEN"


NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.  
NY01075789_lg catalogNumber (NY01075789) in the csv file; but it is (01075789) in the text file.  
Line 101: Line 102:
NY01075822_lg no scientificName  
NY01075822_lg no scientificName  


NY01075823_lg identifiedBy  
NY01075823_lg identifiedBy (Bryan: ?? I do not see the problem)


TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation  
TENN-L-0000001_lg verbatimLocality mixed with verbatimElevation  
Line 107: Line 108:
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).  
TENN-L-0000010_lg verbatimLocality contains (Exposure W,) but habitat contains (Exposure W).  


TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file.  
TENN-L-0000012_lg verbatimLocality (apria -s ) in the text file; but it is (apricas) in the csv file. (Bryan: on the label the word is hyphenated and split across lines. SO, if we ignore the "New Line" the word is "apria -cas" and not what Qianjin listed. so, folloing the rule for gold of making perfect OCR, character by character, the csv shoudl say&nbsp;"apria -cas")


TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.  
TENN-L-0000014_lg identifiedBy (H. Kashiwadani) in the csv file; but it is identifiedBy (S. Kurokawa and H. Kashiwadani) in the text file.  
Line 220: Line 221:


<br> '''Silver Parsed CSV Files'''  
<br> '''Silver Parsed CSV Files'''  
'''(Bryan: I do not get most of these. There should be OCR errors in silver. We do need to stay true to the OCR output.)&nbsp;'''


"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)  
"Silver Parsed CSV Files" There were some errors in the Silver CSV dataset. (Steven C.)  
Line 227: Line 229:
NY01075761_lg misspelling in verbatimScientificName  
NY01075761_lg misspelling in verbatimScientificName  


NY01075762_lg misspelling in habitat misspelling in verbatimLocality  
NY01075762_lg misspelling in habitat misspelling in verbatimLocality (Bryan: I think the problem is that habitate is concatinated with the substraight "Abies [7a[5amzfem—Betu[a papyrzfem forest over granite adjacent to waterfall, parasite on Peltigera scabrosa". the parasite part is from another part of the label. It shoudl be on a new row.)


NY01075764_lg misspelling in units for verbatimElevation  
NY01075764_lg misspelling in units for verbatimElevation  

Navigation menu