Dataset Errata: Difference between revisions

Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
== Errors noted in various files  ==
== Errors noted in various files  ==


New Errors 2/27/13, D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???) (Bryan: All "special characters should be preserved by using UTF-8)  
New Errors 2/27/13, D. Lafferty Label NY01075759_lg.txt has authority (part of verbatimScientificName) as: "Kocourková & F. Berger". Gold Parsed NY01075759_lg.csv has "Kocourkova & F. Berger", without the accent on the "a". (Or should we convert foreign characters to English characters???) (Bryan: All "special characters should be preserved by using UTF-8) (Ed: Accented "á" fixed)


'''Gold Parsing Errors'''  
'''Gold Parsing Errors'''  


Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl) (Bryan: I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.)  
Many of the Lichen Gold labels have verbatimLatitude and verbatimLongitude, but the Gold Parsed files do not have the calculated decimalLatitude and decimalLongitude. This seems especially true for the New York labels. (Daryl) (Bryan: I think the decimal values were "bonus" I could be wrong. If we choose to do this later it might be easier to pre-fill as many fields as we can using your algorithm.) (Ed: Verbatim field contain verbatim results. No lichen labels have DwC complaint decimal coordinates. Likewise, no labels has DwC complient event dates, thus you probably only want to only use verbatim fields for stats)


This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl) (Bryan: Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.)  
This is open to debate, but I think Elevation should be a pure numeric field, assumed to be in meters. Therefore, it should not be expressed as "750 m", but rather as "750". verbatimElevation, of course, should retain the "m" if it was present on the label. (Note that Darwin Core apparently does not have a field called "elevation", but rather MinimumElevationInMeters, and MaximumElevationInMeters, both numeric fields.) Not sure if this is something to change on the labels, but worth being aware of. I think parsing programs should generate the Darwin Core fields. (Daryl) (Bryan: Odd to not have "elevation" I agree with the use of verbatimElevation. If "elevation" is filled it is numeric.)  
Line 13: Line 13:
Many Gold Parse Tennessee lichen labels have country errors. Examples:  
Many Gold Parse Tennessee lichen labels have country errors. Examples:  


-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)  (Bryan: Agreed. Should be fixed to match the label.)  
-- Gold Parsed TENN-L-0000001_lg.csv lists country as "USA", but on the .txt label, it is "U.S.A." (with periods). Same with Gold Parsed TENN-L-0000035_lg.csv and others.(Daryl)  (Bryan: Agreed. Should be fixed to match the label.) (Ed: fixed, note that TENN-L-0000035_lg.txt has "U. S. A. " with spaces, thus conserved format)


-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl) (Bryan: Agreed. Should be fixed to match the OCR label.)  
-- Gold Parsed TENN-L-0000005_lg.csv leaves country blank, but the label shows it as "USA". Again, maybe this is OK, but it should be consistent. (Daryl) (Bryan: Agreed. Should be fixed to match the OCR label.) (Ed: Fixed, country had county value)


<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  
<br> Inconsistency and errors in TENN Lichen Gold Parsed dateIdentified. Examples:  


-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. (Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. )  
-- TENN-L-0000015_lg.csv has dateIdentified in the wrong format, neither verbatim, nor standard DarwinCore format: Verbatim would be: Nov. 12, 1939, DarwinCore would be: 1939-11-12, Listed is: 1939-November-12. (Bryan: I think excel may have imposed it's own format and changed the records. If the column type is set to "text" Excel will not transform to a new format. ) (Ed: The verbatimEventDate (not dateIdentified is in the correct format, but if you open file in excel it will convert the display to match program's defaults)


-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  
-- TENN-L-0000017_lg.csv omits dateIdentified, though it is on the label as 3 Feb. 1963&nbsp;(Bryan: Agreed. Should be fixed to match the label.) (Ed: That is the collection date, not the dateIdentified. Format is correct in verbatimDate)


-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)&nbsp;(Bryan: Agreed. Should be fixed to match the label.)  
-- TENN-L-0000019_lg.csv has 1954-Aug-8, but on the label it is "8 Aug 1954", again neither verbatim nor DarwinCore (1954-08-08). (Daryl)&nbsp;(Bryan: Agreed. Should be fixed to match the label.)