Forums:
Please review the proposed iDigBio policy related to creating and managing GUIDs. Community comment is requested and encouraged. Edit: The comment period will remain open until 3/2/2012.
Please review the proposed iDigBio policy related to creating and managing GUIDs. Community comment is requested and encouraged. Edit: The comment period will remain open until 3/2/2012.
Should identifiers be unique and persistent?
There seems to be no stated recommendation that identifiers be either unique identifiers or persistent identifiers in the sense of the Definitions.
Definitions of uniqueness and persistence
The definitions offered of unique identifier and of persistent identifers ambiguously overlap by virtue of the text "associated with a single object" in the definition of persistence. Probably the definition of persistence would be less problematic if only the point of the second sentence were made. Perhaps the definition would be simply "A persistent identifier is one which is never assigned to a different object, whether or not the object to which it is assigned persists."
------
In the definition of unique identifier:
(a)what would be an example of an ambiguous name?
(b)what would be an example of a duplicated name?
If there are no such examples, there is no point to mentioning those attributes. The definition misses an opportunity to clarify the main thing that confuses neophytes, which is that uniqueness doesn't mean that an object must have only one identifier. Perhaps a better first clause might be "an identifier that names at most one object at any point in time."
specimen vs specimen record
"The primary digital catalog record of a specimen may be identified with the specimen’s identifier or may have its own identifier."
According to the definition of uniqueness offered, if a catalog record is also identified with the specimen's identifier, it is not unique, since the record and the specimen are not the same object. The sentence "Each distinct digital object should have its own identifier." does admit the possibility that it is not unique in the sense of the definition, but then it will not be an IETF/W3C compliant URI, which is probably not the intent.
It is easy to construct a use-case where making the primary specimen record identifier be that of the specimen may raise issues. Consider a specimen on loan to another institution. In that case, the institution holding the specimen record will not be the same as that holding the specimen. In turn, if dereferencing provenance is important, that might impose a burden on the derferencing service to provide data about both the record and the specimen, which it may be unwilling or unable to do.
Most of the section "What to Identify" conflates specimens with specimen records. I think this is a bad idea and unnecessary.
Resolution and parseable URIS
The first paragraph of the document almost treads on the wide misuse confusing URI resolution with URI dereferencing. As described in the Appendix, the proxy server is indeed a resolution service in the IETF and W3C sense, but an insufficiently informed reader of the Guidelines might come to believe that the URI string itself holds information for retrieving the information without actually going through a resolution service. A resolution (in the IETF/W3 sense) that is based on parsing the URI string may be correct at some point in time, but not in the future (e.g. if the dereferencing service changes, requring a resolution service to return a different URL). Hence the form of the http URI matters not one whit. But some stakeholders may behave as though it does, and it is probably difficult to convince people that this is a bad idea if the string appears meaningful to a human.
Guids vs. multiple digital images
If I am understanding the proposal correctly, a URI should reference uniquely either an object or a digital representation of the object. Given the format of the proposed URI, how does one distinquish them if you are using the barcode as the object name for both? What if you have digital images of the same object stored in two formats (e.g., DNG and JPG) - how do you distinguish them?
Add new comment Guids vs. multiple digital images
You can't distinguish them, which is why under this model such an identifier would not be a URI as defined by IETF STD 66, aka rfc 2986, as specified in https://datatracker.ietf.org/doc/rfc3986/ Separate URI's are needed for the physical object and each digital representation of it, which includes metadata records in a specimen database. dc:references is available in Darwin Core and would serve to associate a URI for the physical object with one for a digital representation. One problem with this is that Dublin Core terms have no formal semantics, and it is not exactly clear what to make of dc:references if one wishes to have several different kinds of things referenced by, e.g., the URI for a physical object.
The format issue is even more delicate for a number of reasons, not the least of which is that some formats are usually lossy compression---jpg normally is---so there is no robust way to even do a reliable pixel-wise comparison between two formats, at least one of which is lossy. (For example, compressing a lossless format file into jpg, then decompressing it, then recompressing it will in general not lead to the same jpg bytewise. ) These arguments favor lossless compression, but that is another story.
Some of these issues are treated in http://species-id.net/wiki/Audubon_Core_Term_List , hopefully soon to be opened for public comment as a proposed TDWG standard.
Take guidance from the TDWG DarwinCore standard
The TDWG DarwinCore defines a term occurranceID http://rs.tdwg.org/dwc/terms/index.htm#occurrenceID. The definition of this term specifies that it is an identifier for the occurrance, not a digital representation of the occurrance. The commentary for the term also suggests a scheme for identifying specimens that don't have a GUID with the scheme: urn:catalog:[institutionCode]:[collectionCode]:[catalogNumber]. There is another term, drawn from DublinCore, dcterms:references http://rs.tdwg.org/dwc/terms/index.htm#dcterms:references, which could be used to reference a digital representation of an Occurrance.
Catalog numbers come with a guarantee of non-uniqueness
Elaborating on the point made in the document that a darwin core triplet of institution, catalog number series, and catalog number is non-unique: It is very likely that catalog numbers assigned to collection objects from workflow processes that involved cataloging into ledgers have cases that can be seen as permanently non-unique. We know that cataloging into ledgers has produced runs of duplicate catalog number assignments (more than one different collection object being assigned the same catalog number), often in blocks of pairs of ledger pages where the starting number on a new page was copied from the page before the previous page, rather than the previous page. Anecdotaly, I've encountered cases where two different specimens sharing the same number were type specimens of different taxa, and those numbers have been published. In the presence of an institutional policy that prohibits the recataloging of types, such specimens would permenantly bear the same non-unique number.
URI is not necessarily a URL
I'd like to elaborate on the following quote from the Guidelines for Managing Unique Resource Identifiers document, paragraph 3.
Several conversations in the collections community indicate there is perhaps still some confusion regarding URIs and URLs. In reality, a URL is a special type of URI. A URL is a type of URI in that it must uniquely identify something AND when you click on it (or enter it in a browser), you get something back (a web page, a downloable file, some metadata...). So, a URL is both a URI (a unique identifier) and a locator (actionable and resolvable).
A URI is (hopefully) unique and persistent,but does not need to be actionable or resolvable (not a locator). Nothing is necessarily expected to happen when one "clicks" on this. A URI is an identifier, it is not required to be a locator in the sense of a URL. In the suggested format, the organization chooses and registers a domain name string as desired to create a unique URI pattern when put together with the other elements (prefix + domain + collection identifier + object name). The entire string represents the unique identifier for a given object. Another way to say this is, the string is atomic.
GUID selection
I am assuming that IdigBio will provide assistance to providers in selecting and registering URI schemes with both IdigBio and IANA? I can see this as being a difficulty in collections with only limited IT experience and little or no web presence for their digital objects.
Registration of schemes
iDigBio is committed to providing assistance to providers and will create and maintain a website for registering identifier templates.
As to URI schemes: I don't think that we will be creating any new URI schemes, but rather using the existing HTTP or urn schemes.
General comments on proposed guidance
Some general comments from my colleagues about the proposed GUIDs guidance:
Agreement with need for services
I am in agreement with the described need for directory services and change management. Putting Identification on specimens does not solve problems of changes in URLs, changes in specimen properties, changes in specimen ownership or location, or the inability of an organization to commit to identification. We have to help people understand the advantages and responsibilities of identification and help organizations commit to proper identification policies and behaviors.
The DOI (digitial object identifiers) system has been designed and implemented to identify publications and other digital objects. It does not support the large number of identifiers that must be created and maintained for the hundreds of millions of objects that will be in the iDigBio portal. The success of the ADBC project will depend on the correct application of identifiers and the ability to invoke services that produce information about the objects. The DOI system provides a good example of how a community can make identification and resolution work, but does not directly support our needs.
Greg Riccardi
RFC Closed
Thank you to all who reviewed and provided feedback regarding the iDigBio GUID Policy document. The final document has been posted to the iDigBio website. Other Frequently Asked Questions and explanations have been moved to a GUID wiki page on the iDigBio website for review, update and additional community comment. We expect this appendix document to mature as questions related to GUID implementations are raised by the community. We welcome your notes and interaction on the GUID wiki page.