Persistent Identifier (PID) demonstrator

Persistent identifiers demonstrator for Towards a National Collection - HeritagePIDs.

Version 2 rethink

Version 1 (described below) used a triple store for the annotations, which means we need RDF and SPARQL. This means we can embed this problem in a much bigger space (e.g., the biodiversity knowledge graph) but it makes life more complicated, especially if we want to be able to easily add examples.

Version 2 tackles this by dropping the triple store and just using a local table of examples, based on a spreadsheet.

Version 1

Notes

By default W3C annotation model seems obvious candidate, but also look at simpler things such as bioschemas.org

{
“mainEntity” : “URI for entity”,
“text” : “text corresponding to entity (e.g., highlighted in markup)” ,
“position” : “location in publication”,
“subjectOf” : “publication”
}

Configuration

When developing locally, put “secrets” in env.php (which is not in the Github repo). When running on Heroku (or elsewhere) add secret values as environmental parameters:

Key	Value
HYPOTHESIS_API_TOKEN	6879-06f…
BLAZEGRAPH_URL	http://167…

Annotations

Discuss that we are concerned with a subset of annotations, namely those where both body and target are external entities that have stable(ish) identifiers.

Displaying annotations

There are at least two problems to solve here. The first is displaying the actual annotations in situ in the document being annotated, as well as enabling them to be created and edit. This is the problem hypothesis.is solves.

The second problem arises if we are using annotations to represent links between two entities, such as a specimen and a publication that mentions that specimen. Ideally we should be able to display the annotation on the web pages for BOTH entities.

One way to do this is have a bookmarklet that injects HTML into the web page for an entity, and displays annotations linked to that entity.

Tasks

Multiple representations

Annotations can be attached to multiple representations of the same thing, and hypothes.is doesn’t always record the DOI of a paper. Therefore we will need to call a service to convert a URL to an identifier.

Convert text to PID

Selected text need to be parsed and converted to an identifier, need service such as one that converts specimen code to a PID.

Add annotations to triple store

Fetch annotations from annotation feed

Hypothes.is feed is

https://hypothes.is/stream.rss?user=username

Fetch annotations related to content

Need to be able to fetch annotations using source and body identifiers. For example, given a paper that is the source for one or more annotations, what are those annotations? Given specimen that is the body of an annotation, what papers refer to that specimen?

Fetch annotations as user changes view in document

User can scroll through a document, so we need ways to track that movement so we can display relevant annotations. For example, the basic unit of BHL is a scanned physical book, such as a journal volume. Each page has its own unique identifier (the BHL pageID), which makes it natural to link annotations to that PageID. However, the page being viewed by the user can change as they scroll through the document, so we need a mechanism for determining which page the reader is currently viewing so that we can display the appropriate annotation.

In the bookmarklet I use the MutationObserver interface to track whether the Page been viewed has changed, then query the annotation store for any annotations for that page.

(What do we do with PDFs?)

Demonstration examples

Could use NHM specimens example where we have data (that has a DOI http://dx.doi.org/10.5281/zenodo.34966). Need to represent links between specimens as annotations, think about how we give credit in annotation to source. See https://www.w3.org/TR/annotation-model/#lifecycle-information for ideas.

Problems

NHM URLs have changed since Ross and Aime did their work. For example, switched to HTTPS and replaced /specimen/ with /object/, meaning the specimen URLS no longer resolved. This has been fixed see NaturalHistoryMuseum/ckanext-nhm#477. However, some URLS without versions (/\d+ appended to end of URL) can return 404 https://twitter.com/jrdhumphries/status/1278650609667911680

Examples of PIDs and data sources

Institution	Data type	PID	Example	RDF	URL in meta
NHM	specimen	https://data.nhm.ac.uk/object/ + UUID	31a84c68 - 6295 - 4e5b - aa0a - 5c2844f1fb50	yes (extension)	no yes
RBGE	specimen	http://data.rbge.org.uk/herb/ + barcode	E00001237	yes (content negotiation)	no
KEW	specimen	http://specimens.kew.org/herbarium/ + barcode	K001116483	yes (content negotiation) broken	no

See Implementers for list of CETAF sites.

NHM

NHM serves RDF in triples,turtle, and JSON-LD e.g. https://data.nhm.ac.uk/object/31a84c68-6295-4e5b-aa0a-5c2844f1fb50.n3 and https://data.nhm.ac.uk/object/31a84c68-6295-4e5b-aa0a-5c2844f1fb50.ttl RDF is flat Darwin Core.

NHM also serves JSON-LD, e.g. https://data.nhm.ac.uk/object/f62e09d5-1ee4-4c54-86b5-26f3cd9c8750

NHM added SEO-friendly tags, see NaturalHistoryMuseum/ckanext-nhm#476

RBGE

RBGE serves RDF (flat Darwin Core) that includes links to IIIF.

Kew

Kew serves RDF (flat Darwin Core). It is broken

<dc:relation>
 <rdf:Description rdf:about="http://www.kew.org/herbcatimg/686057.jpg">
 <dc:identifier rdf:resource="http://www.kew.org/herbcatimg/686057.jpg"/>
 <dc:type rdf:resource="http://purl.org/dc/dcmitype/Image"/>
 <dc:subject rdf:resource="http://specimens.kew.org/herbarium/K001116483"/>
 <dc:format>image/jpeg</dc:format>
 <dc:description xml:lang="en">Image of herbarium specimen</dc:description>
 <dc:license rdf:resource="https://creativecommons.org/licenses/by/4.0/"/>
 </rdf:Description>
 </dc:relation>
 <dwc:associatedMedia rdf:resource="http://www.kew.org/herbcatimg/686057.jpg"/>

National Gallery

https://data.ng-london.org.uk/0F6J-0001-0000-0000.json

NG6195

Science Museum

https://collection.sciencemuseumgroup.org.uk/api/objects/co8084947

https://collection.sciencemuseumgroup.org.uk/objects/co8084947/stephensons-rocket-steam-locomotive

Related projects

DiSSCo e.g. https://github.com/sharifX/pid-specimen-genbank

Recommendations for PID providers

These recommendations are w.r.t. to making projects like this doable.

Make identifier discoverable within web page for item

The web page for an entity should include the persistent identifier in a machine readable way. For example, academic publishers typically include the DOI for an article in a meta tag (see also Twitter thread https://twitter.com/rdmpage/status/1273274118293553155 )

Include persistent identifier in HEAD of web page
Use standard tag, e.g. canonical link, og:url, etc.
Ideally PID should be resolvable by both browser and machine, e.g. by supporting content negotiation

Use consistent data descriptions

e.g. do museums serving RDF all use same approach?

Reaction

If I was bitter I'd launch into a rant about how come people who diddle with semantic data fail to appreciate that the things which support their particular interests have to be balanced with other priorities by people maintaining production systems on a skeleton staff ;) @vsmithuk

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
cache		cache
data		data
demo		demo
js		js
mounce-rankin		mounce-rankin
pid-demonstrator.bbprojectd		pid-demonstrator.bbprojectd
rdf-approach		rdf-approach
rdf		rdf
reading		reading
.gitignore		.gitignore
Demo.md		Demo.md
Examples.md		Examples.md
LICENSE		LICENSE
PID demonstrator examples - Sheet1.tsv		PID demonstrator examples - Sheet1.tsv
README.md		README.md
Wikipedia use.md		Wikipedia use.md
api_annotations_for_page.php		api_annotations_for_page.php
composer.json		composer.json
composer.lock		composer.lock
config.inc.php		config.inc.php
env-template.php		env-template.php
index.php		index.php
map.json		map.json
parse.php		parse.php
test.html		test.html
test.php		test.php
tester.html		tester.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persistent Identifier (PID) demonstrator

Version 2 rethink

Version 1

Notes

Configuration

Annotations

Displaying annotations

Tasks

Multiple representations

Convert text to PID

Add annotations to triple store

Fetch annotations from annotation feed

Fetch annotations related to content

Fetch annotations as user changes view in document

Demonstration examples

Problems

Examples of PIDs and data sources

NHM

RBGE

Kew

National Gallery

Science Museum

Related projects

Recommendations for PID providers

Make identifier discoverable within web page for item

Use consistent data descriptions

Reaction

About

Releases 2

Packages

Languages

License

rdmpage/pid-demonstrator

Folders and files

Latest commit

History

Repository files navigation

Persistent Identifier (PID) demonstrator

Version 2 rethink

Version 1

Notes

Configuration

Annotations

Displaying annotations

Tasks

Multiple representations

Convert text to PID

Add annotations to triple store

Fetch annotations from annotation feed

Fetch annotations related to content

Fetch annotations as user changes view in document

Demonstration examples

Problems

Examples of PIDs and data sources

NHM

RBGE

Kew

National Gallery

Science Museum

Related projects

Recommendations for PID providers

Make identifier discoverable within web page for item

Use consistent data descriptions

Reaction

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages