Dataset Errata: Difference between revisions

Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
== Errors noted in various files  ==
== Errors noted in various files  ==


New Errors 2/27/13, D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???) (Bryan: All "special characters should be preserved by using UTF-8)
New Errors 2/27/13, D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???) (Bryan: All "special characters should be preserved by using UTF-8)  


'''Gold Parsing Errors'''  
'''Gold Parsing Errors'''  


Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl) (Bryan: I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.)
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl) (Bryan: I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.)  


This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl) (Bryan: Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.)
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl) (Bryan: Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.)  


Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl) (Bryan: I think for Gold the field should not be filled in if it is not on the label.)
Inconsistency in the Gold Parsed labels for Country. If a US State is listed as the state, the label doesn't always say the name of the country, though it is obviously the USA. Some Gold parsed results leave it blank, some fill it in with "USA", or "United States", though neither of these are on the label. I think it is valid to fill it in, but it should be consistent. (Daryl) (Bryan: I think for Gold the field should not be filled in if it is not on the label.)  


Many Gold Parse Tennessee lichen labels have country errors. Examples:  
Many Gold Parse Tennessee lichen labels have country errors. Examples:  


-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)  (Bryan: Agreed. Should be fixed to match the label.)
-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)  (Bryan: Agreed. Should be fixed to match the label.)  


-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl) (Bryan: Agreed. Should be fixed.)
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl) (Bryan: Agreed. Should be fixed to match the OCR label.)  


<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  
<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  


-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12.  
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. (Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )


-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963  
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;(Bryan: Agreed. Should be fixed to match the label.)


-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)  
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)&nbsp;(Bryan: Agreed. Should be fixed to match the label.)


Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)  
Gold Parsed NY01075760_lg.csv replaces the comma with a space, and replaces an apostrophe (') with a double quote (") in verbatimCoordinates: 38°42'20"N, 83°08'25'W is rendered as 38°42'20""N 83°08'25""W. (Note also that the double quote is replaced with two double quotes. This may be necessary to preserve the quote-delimited, comma separated fields, but could cause some problems when uploading to a database. Not presented here as an error, but we should be aware of possible implications.)&nbsp;(Bryan: Agreed. Should be fixed to match the Label except, the double quoted double quote I think is needed for CSV readers to identify fileds. I am not sure)


Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.  
Gold Parsed NY01075764_lg.csv has a similar problem where a single space is replaced with a double space in verbatimCoordinates.&nbsp;(Bryan: Agreed. Should be fixed to match the label as best as possible. If it is not clear follow the OCR file.)


Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period.  
Inconsistencies in several Gold Parsed labels regarding whether to include the period at the end of a field as part of the field. Example: verbatimCoordinates in NY01075782_lg.csv includes the period at the end. NY01075780_lg.csv does not include the period. (Bryan: It could go either way but I think for consistancy throughout we should keep the period at the end of anything. Whenever there are sentences with a period we keep them as in "One mile east of Dodge City." we would not think of removing the period. In gold we should treat it as verbatim. If we do platinum it could be removed.)


Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).  
Gold Parsed NY01075761_lg.txt corrects a Gold OCR error by adding the 1 to the end of 0107576. The field should be corrected in the Gold OCR, but until done so, the parsing should be verbatim (see below under Gold OCR Errors).&nbsp;(Bryan: hmmm. Gold shoudl be as-if the OCR engine read the label with no mistakes. Silver would leave it as 0107576 but in Gold we shoudl put what was on the label. That would include the "1".)


Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.  
Gold Parsed NY01075766_lg.csv omits the catalogNumber, though it is present in the NY01075766_lg.txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)


Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.  
Gold Parsed NY01075789_lg.csv adds "NY" as a prefix to the catalogNumber, though it is not present on the .txt file.&nbsp;(Bryan: Agreed. Should be fixed to match the Label.)


Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"  
Gold Parsed NY01075760_lg.csv omits dataset, which should be "Lichens of Ohio"&nbsp;(Bryan: Agreed. Should be fixed to match the label.)


Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.  
Gold Parsed NY01075775_lg.csv omits "Boulder" from municipality.(Bryan: Agreed. Should be fixed to match the Label.)


Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...?  
Gold Parsed NY01075761_lg.csv lists "Peru" as municipality, but NY01075779_lg.csv lists "Town of Peru", though they both appear identical ("Town of Peru") on the labels. Probably should be "Peru" on both...? (Bryan: This is complex but in this case I think it should be "Town of Peru". The bald "Peru" could be read as an error with the misplacment of "Country". Likely the Label author was worried about the same thing and included "Town of". In both cases include what is on the label.)


Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"  
Gold Parsed TENN-L-0000029_lg.csv and TENN-L-0000035_lg.csv both list municipality as "NORTH AMERICA"&nbsp;(Bryan: Agreed. Should be assigned to "country".)


Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.  
Gold Parsed TENN-L-0000083_lg.csv lists municipality as "Ontario", but this is not on the label.&nbsp;(Bryan: Agreed. Should be fixed to match the label.)


Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others.  
Gold Parsed WIS-L-0011732_lg.csv (and many other lichen gold parsed labels) removes a space from verbatimLatitude and from verbatimLongitude, changing this: 60° 33.579'N into this: 60°33.579'N. The space removal is inconsistent, on some labels, not on others. &nbsp;(Bryan: Agreed. Should be fixed to match the label. I think if the OCR had been perfect the space would not be n the OCR file do it is a tough call.)


Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series.  
Gold Parsed NY01075791_lg.csv converts the "u" in "Mull" to an umlaut yielding "Müll". This actually reflects the original label, but not the Gold OCR NY01075791_lg.txt file, which has "Mull". Same for NY01075792_lg.csv, and several other in the series. (Bryan: The OCR messed up. Gold should fix OCR errors so the umlaut shoudl saty.)


'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)  
'''Gold Parsed CSV Files''' There are more errors in gold csv files. (Qianjin)