Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export data as Darwin Core #256

Closed
59 tasks done
peterdesmet opened this issue Nov 21, 2022 · 28 comments
Closed
59 tasks done

Export data as Darwin Core #256

peterdesmet opened this issue Nov 21, 2022 · 28 comments
Assignees
Labels
enhancement New feature or request

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Nov 21, 2022

Add function to export data as Darwin Core

Human Observations

  • type
  • license: provided by user, CC-BY / CC0
  • rightsHolder: provided by user
  • datasetID: URL including imis_dataset_id
  • institutionCode: VLIZ
  • collectionCode: ETN
  • datasetName: imis title
  • basisOfRecord
  • dataGeneralizations: NULL
  • occurrenceID
  • sex
  • lifeStage
  • occurrenceStatus
  • organismID
  • organismName
  • eventID
  • parentEventID
  • eventDate
  • samplingProtocol
  • samplingEffort
  • eventRemarks
  • locationID: NULL
  • locality
  • decimalLatitude
  • decimalLongitude
  • geodeticDatum
  • coordinateUncertaintyInMeters: 30
  • scientificNameID
  • scientificName
  • kingdom: Animalia

Detections

  • type
  • license
  • rightsHolder
  • datasetID
  • institutionCode
  • collectionCode
  • datasetName
  • basisOfRecord
  • dataGeneralizations: subsampled by hour: first of 3 record(s)
  • occurrenceID: detection id_pk
  • sex: continued from deployment (cf. Movebank)
  • lifeStage: not set, as it can change through lifetime
  • occurrenceStatus
  • organismID
  • organismName
  • eventID
  • parentEventID
  • eventDate
  • samplingProtocol: acoustic telemetry
  • eventRemarks: NULL
  • locationID: deployment_station_name
  • locality: location_name
  • decimalLatitude
  • decimalLongitude
  • geodeticDatum
  • coordinateUncertaintyInMeters: set to 1000m
  • scientificNameID
  • scientificName
  • kingdom
@peterdesmet peterdesmet added the enhancement New feature or request label Nov 21, 2022
@peterdesmet peterdesmet self-assigned this Nov 21, 2022
@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 21, 2022

@jreubens @jonasmortelmansvliz, some questions:

  1. Is there a standardized value to be used for license? Is it always CC0 (recommended) or CC-BY?
  2. Is it possible to retrieve the organization associated with an animal project? That would then be the rightsHolder
  3. Should an event be created for surgery (in addition to capture, release, recapture)? I believe we decided not to.
  4. We could create an event for recapture, but it won't have a lat, long or location (only date). Is this worth it?
  5. What is the standard coordinateUncertaintyInMeters for the human observations (capture, surgery, release, recapture)?
  6. What identifier should be used for the tag. Only the internal tag_id is unique across all tags, but might not be immutable. Serial number or acoustic tag id might be better, but are not unique for all tags. Note that we also use the internal identifier for the animal.

@peterdesmet
Copy link
Member Author

I notice that the coordinating_organization as stored in the DB can be really long:

Vlaamse overheid; Beleidsdomein Omgeving; Instituut voor Natuur- en Bosonderzoek

vs

INBO

I think I'll include rights_holder this as a user provided value

@jreubens
Copy link
Collaborator

@peterdesmet regarding the questions:

  1. we use CC-BY as standard
  2. Not sure how to implement: a project is always linked to a group (but it can be linked to different groups)...So it is not 1 on 1. Better to link it to an organization through the PI?
  3. Does this has added value?
  4. If this doesn't has coordinates it doesn't has added value. Should we make recap lat and long compulsory? (problem is that we don't always know that..)
  5. Not certain what you mean here. ... I suppose standard GPS uncertainties.
  6. I would go for serial number., this should be unique per device (but a device can indeed have 2 tag IDs (linked to 2 sensors)

@peterdesmet
Copy link
Member Author

Thanks!

  1. Ok, I will use CC-BY by default, but allow to set it to CC0
  2. I will have the user provide a rightsHolder as part of the function (none by default)
  3. I can remove it if not useful
  4. I think recapture coordinates would be good to have in the database. I think I'll keep he record for now (even without coordinates), so users are aware that the animal was recaptured (the date is useful too)
  5. Ok, then I will use 30m (standard GPS), assuming all coordinates were captured with GPS
  6. Ok, I will investigate serial number

Can archival tags be associated with an animal, or only acoustic tags?

@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 22, 2022

  1. Serial number is likely the best approach. Note that currently 3,291 of the 22,780 rows in the tag_device table have duplicate serial_number (see Some tag_device have the same serial_number: is this by design? #173), but I notice comments such as "not ok, double serial_numbers with different attributes", so I'm guessing this will be resolved at some point?
  2. Is the full animal project title stored somewhere in the database?

@peterdesmet
Copy link
Member Author

  1. @aubrivliz are acoustic.detections_limited records returned in chronological order (i.e. sorted on datetime within animal)?

@jreubens
Copy link
Collaborator

Yes, archival tags are also associated with an animal. It has a serial number which is linked to the animal-ID
6. I'll check this in more detail with data centre
7. yes on project level we have 'code' and 'title' . code is the short name, title the full name. @peterdesmet Can you access project metadata?

@peterdesmet
Copy link
Member Author

  1. In common.projects we have the information name (2014 Demer) and code (2014_demer). The information I'm looking for is the title (below) as provided in IMIS. Is that information also accessible through the ETN DB or do I need to query that from IMIS?

2014_DEMER - Acoustic telemetry data for four fish species in the Demer river (Belgium)

  1. For the archival tags, I'll have to test with an example to see if I get the correct data. Do you know one by heart?

@PieterjanVerhelst
Copy link
Collaborator

@peterdesmet do you mean an example of a tag serial number? You can take sensor ID A16031 which is from an eel tagged in 2018.

@peterdesmet
Copy link
Member Author

@PieterjanVerhelst thanks, I see that is a G5 pop-off tag with pressure and temperature sensors. In Darwin Core we express occurrences, i.e. observations/detections of an organism at a place and time. Am I correct in understanding that is not what archival tags record, unless they are a combined archival-acoustic tag?

@PieterjanVerhelst
Copy link
Collaborator

@peterdesmet the archival tags indeed register pressure and temperature. They do not log positions. Positions are obtained based on the logged pressure and temperature data through a modelling method called 'geolocation'. So if you want tracking data obtained from the archival tags into Darwin Core, this will be processed data. Note that geolocation modelling requires certain assumptions and links to specific databases, so when these are changed, a slightly different trajectory can be obtained. Or in other words: there is some error on the position. It is not as accurate as acoustic telemetry.

@peterdesmet
Copy link
Member Author

Great, in that case I am keeping processed position data from archival data out of scope for Darwin Core.

@jreubens
Copy link
Collaborator

regarding 7. this information goes through IMIS.
In the ETN project metadata, there is a link to the IMIS record

@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 29, 2022

All fields are now mapped in #257. Remaining questions @jonpye @jreubens @jonasmortelmansvliz

  • samplingProtocol: acoustic detection or acoustic telemetry (to not confuse with sound recording) @jonpye prefers acoustic telemetry, me too.
  • parentEventID / eventID: currently consists of animal_id + tag_serial_number. Should the manufacturer be included with the serial number to avoid collisions? So 304_1187449 -> 304_VEMCO_1187449? Not necessary, see Export data as Darwin Core #256 (comment)
  • coordinateUncertaintyInMeters for human observations: these positions of capture, release, ... are likely taken by GPS. Assume coordinate precision of 0.001 degree (157m) and recording by GPS (30m)
  • coordinateUncertaintyInMeters for acoustic detections: here we have the uncertainty associated with the GPS coordinates of the receiver and the distance of the fish to the receiver. This can depend on many conditions. What would be a good fixed (upper) value for this uncertainty? 300m, 500m, 1000m? Assume coordinate precision of 0.001 degree (157m), recording by GPS (30m) and detection range of around 800m ≈ 1000m, see Export data as Darwin Core #256 (comment)
  • type is currently set to Event(current DwC recommendation). Is there a preference for more specific EventType? None of the values suggested at Agree on way to name events in Event Core (Type vocabulary) iobis/env-data#4 (comment) fit, so sticking with Event
  • lifeStage can currently contains verbatim values, such as FV, smolt. Should this be left as is or standardized to https://registry.gbif.org/vocabulary/LifeStage/concepts controlled, see Map lifeStage to controlled vocab in DwC #262
  • organismID: should this be 304 or etn:304. 304, see Export data as Darwin Core #256 (comment)

Sorry, something went wrong.

@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 29, 2022

@jdpye comments:

if tag serial number is just numeric, there's a very good chance two manufacturers will (or already have) used the same ### for a tag. If characters are cheap, we could use a short string to identify them. vemco-12343465, thelm-213512, lotek-2435321 etc.

Here are all the possible manufacturers:

VEMCO
THELMA BIOTEL
CHELONIA
OCEANINSTRUMENTS
LOTEK
RTSYS
B&K
SEICHE
SONOTRONICS
CEFAS TECHNOLOGY LIMITED
INNOVASEA
DESERT STAR
AANDERAA
STAR-ODDI

Are there some we should shorten?

@peterdesmet
Copy link
Member Author

@jdpye, here's how the identifiers would differ if we include the manufacturer:

parentEventID | eventID             | samplingProtocol
------------- | ------------------- | ------------------
304_1187449   | 304_1187449_capture | capture
304_1187449   | 304_1187449_release | release
304_1187449   | 21676626            | acoustic telemetry
304_1187449   | 20744955            | acoustic telemetry
parentEventID       | eventID                   | samplingProtocol
------------------- | ------------------------- | ------------------
304_vemco:1187449   | 304_vemco:1187449_capture | capture
304_vemco:1187449   | 304_vemco:1187449_release | release
304_vemco:1187449   | 21676626                  | acoustic telemetry
304_vemco:1187449   | 20744955                  | acoustic telemetry

Note that within a dataset the identifiers are unique in both cases. Adding the manufacturer would only solve the use case where we want to combine multiple datasets where the animal_id+serial_number combination looks identical, but is actually different and they originate from a different manufacturer. I would argue that such clashes in name spaces are likely rare and can be solved by taking the source (e.g. datasetID or collectionCode=ETN) into account.

@jdpye
Copy link

jdpye commented Nov 30, 2022

oops, I see these now, I'm @jdpye here and @jonpye on Google-side.

I would say 1000m (or more) from a center point is reasonable. If you are looking for a flat upper bound for coordinateUncertaintyInMeters, the accepted range of a high-powered tag in open water is around that. We place gates 800m apart to cover most cases of degraded signal transmission. This is a highly variable situation, with range testing potentially able to provide a better answer, so with the caveat that we're going to try and do better if the research programme can tell us a better number, we can stick with 1000m as a generic 'acoustic telemetry' upper bound. This paper that describes an open-water experiment has a nice figure and a more nuanced view of how reasonably likely detections at each distance would be before taking into account other factors. https://animalbiotelemetry.biomedcentral.com/articles/10.1186/s40317-017-0142-y

In river systems and turbid waters you get a lot less detectability and a lot of dependency on environmental conditions: https://link.springer.com/article/10.1007/s10750-021-04556-3

So, how much of this can we even capture with a single (high?) number? And where we can do better, a paragraph in the metadata about how the range estimation was done at each station could be included. There are examples in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13322 of how to extrapolate conditions that drive variability once you can characterize them from a few exemplar stations, for example, and this could be one approach researchers use to quantify their predicted station range.

My take is, if they give us expected ranges, report those, and if they don't, shoot high on the detection variability, 1000m or even higher. And describe in the metadata what the source of the coordinateUncertainty was, a round high estimate based on the technologies in play, or a specific characterized estimate based on what the researchers were able to calculate.

@peterdesmet
Copy link
Member Author

Thanks @jdpye, I'll have the write_dwc() function use 1000m as the default upper limit and refer to your comment. Users can always alter the output of the write_dwc() to be more precise.

Any feedback on default coordinateUncertainty for human obs (capture, release, ...) and regarding those identifiers?

@jdpye
Copy link

jdpye commented Nov 30, 2022

I don't have any info that would overrule the standard commercial GPS precision being ~30m, I'd be happy to roll with that.
coordinatePrecision might be an interesting companion to this, if data creators are recording only a few decimal places of location for their release/deployment points, would we hash that out in coordinatePrecision instead?

@diniangela
Copy link

For the standard vocabularies for lifestage, we at OTN hold a link to the NERC vocabulary: http://vocab.nerc.ac.uk/collection/S11/current/

@peterdesmet
Copy link
Member Author

peterdesmet commented Nov 30, 2022

coordinateUncertainty for human obs
@jdpye We could go for 30m, but apparently I opted for 1000m for Movebank data, since the origin of the coordinates is often not known. I think the same is true here. 1000m is likely to capture most uncertainty, but thoughts welcome (and then I will change both). I wouldn't add coordinatePrecision, since that is often unknown as well (and you can't always reliably tell from the coordinate decimals).

coordinateUncertainty for detections
@jdpye Do you think it worth to add this as a parameter to the function, with 1000m as default?

lifestage
@diniangela ok, creating a separate issue for mapping: #262

@jdpye
Copy link

jdpye commented Nov 30, 2022

i like the idea of having a parameter for the function, but i can see the scope creeping. Would the user provide a different blanket value for the whole export, or a conditional blanket value, or be required to sub in their own per instrument or per event column of data?

The default is what nearly all existing data will be published with so we all definitely have to be happy with the 1000m being an acceptable signal that 'we don't know better'.

@peterdesmet
Copy link
Member Author

peterdesmet commented Dec 1, 2022

@jdpye you're right, the user would just provide a blanket value, so I won't add a parameter. I'll set the value for the detections at 1000m (with the user always having the option to improve upon that before publishing).

@peterdesmet
Copy link
Member Author

Another item @jdpye and I discussed on slack is whether organismID should be prepended with a name space to avoid name clashes. E.g. within a system like ETN, the identifier 304 is unique, but that is likely not the case when mixed with other datasets on e.g. OBIS/GBIF. In the absence of a central registry of animal identifiers (like WikiData) one idea is to express the organismID as etn:304 to make it globally unique.

I'm personally not a fan of this idea for two reasons:

  1. Users (or scripts) are likely to search for the original identifier (304) to find records in OBIS/GBIF. If the namespace is included (etn:304) in the published data, they won't find any results.
  2. Users are likely familiar with getting multiple datasets for a certain filter. They can then filter on datasetID=x or collectionCode=ETN to get the intended scope anyway.

I would therefore suggest to include the original identifier as is in the published data.

/cc @timrobertson100 @tucotuco @sarahcd

@timrobertson100
Copy link

Thanks @peterdesmet

I think it's reasonable to say that GBIF would currently (always?) have to assume the identifiers are local and make use of approaches that use datasetID + organismID - simply because you know it can't be enforced across such a wide variety of publishers.

If you cared to join across ETN datasets, then having a prefix may help (i.e. your etn:304 example), but that may still have collisions with outsiders and most likely there would be alternative ways to make the join (e.g. something along the lines of organismID=304 AND network=ETN).

Aside: For identifier schemes like DOI, ARK, LSID etc. it may be more reasonable to assert relationships across publishers as it's more likely they do indeed mean the same thing, but that isn't what you're considering here.

Does that help?

@peterdesmet
Copy link
Member Author

@timrobertson100 thanks, yes. It aligns with my thinking that using the original (non-prefixed) organismID is probably the best approach.

If you cared to join across ETN datasets, then having a prefix may help ...

We already make sure that the original (non-prefixed) organismID is unique within ETN. I.e. there is only one 304 across all datasets/studies in ETN, because it's the internal identifier assigned by the database (human assigned identifiers being unreliable). Adding etn: here doesn't change anything.

@peterdesmet
Copy link
Member Author

All questions/issues (#256 (comment)), closing issue.

peterdesmet added a commit that referenced this issue Dec 5, 2022
@sarahcd
Copy link

sarahcd commented Dec 6, 2022

Belatedly, I agree with reasoning from @peterdesmet and @timrobertson100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants