Export data as Darwin Core #256

peterdesmet · 2022-11-21T17:26:10Z

peterdesmet · 2022-11-21T17:27:15Z

@jreubens @jonasmortelmansvliz, some questions:

Is there a standardized value to be used for license? Is it always CC0 (recommended) or CC-BY?
Is it possible to retrieve the organization associated with an animal project? That would then be the rightsHolder
Should an event be created for surgery (in addition to capture, release, recapture)? I believe we decided not to.
We could create an event for recapture, but it won't have a lat, long or location (only date). Is this worth it?
What is the standard coordinateUncertaintyInMeters for the human observations (capture, surgery, release, recapture)?
What identifier should be used for the tag. Only the internal tag_id is unique across all tags, but might not be immutable. Serial number or acoustic tag id might be better, but are not unique for all tags. Note that we also use the internal identifier for the animal.

peterdesmet · 2022-11-21T17:54:55Z

I notice that the coordinating_organization as stored in the DB can be really long:

Vlaamse overheid; Beleidsdomein Omgeving; Instituut voor Natuur- en Bosonderzoek

vs

INBO

I think I'll include rights_holder this as a user provided value

jreubens · 2022-11-22T13:01:16Z

@peterdesmet regarding the questions:

we use CC-BY as standard
Not sure how to implement: a project is always linked to a group (but it can be linked to different groups)...So it is not 1 on 1. Better to link it to an organization through the PI?
Does this has added value?
If this doesn't has coordinates it doesn't has added value. Should we make recap lat and long compulsory? (problem is that we don't always know that..)
Not certain what you mean here. ... I suppose standard GPS uncertainties.
I would go for serial number., this should be unique per device (but a device can indeed have 2 tag IDs (linked to 2 sensors)

peterdesmet · 2022-11-22T13:16:39Z

Thanks!

Ok, I will use CC-BY by default, but allow to set it to CC0
I will have the user provide a rightsHolder as part of the function (none by default)
I can remove it if not useful
I think recapture coordinates would be good to have in the database. I think I'll keep he record for now (even without coordinates), so users are aware that the animal was recaptured (the date is useful too)
Ok, then I will use 30m (standard GPS), assuming all coordinates were captured with GPS
Ok, I will investigate serial number

Can archival tags be associated with an animal, or only acoustic tags?

peterdesmet · 2022-11-22T13:44:14Z

Serial number is likely the best approach. Note that currently 3,291 of the 22,780 rows in the tag_device table have duplicate serial_number (see Some tag_device have the same serial_number: is this by design? #173), but I notice comments such as "not ok, double serial_numbers with different attributes", so I'm guessing this will be resolved at some point?
Is the full animal project title stored somewhere in the database?

peterdesmet · 2022-11-22T16:50:59Z

@aubrivliz are acoustic.detections_limited records returned in chronological order (i.e. sorted on datetime within animal)?

jreubens · 2022-11-22T19:09:15Z

Yes, archival tags are also associated with an animal. It has a serial number which is linked to the animal-ID
6. I'll check this in more detail with data centre
7. yes on project level we have 'code' and 'title' . code is the short name, title the full name. @peterdesmet Can you access project metadata?

peterdesmet · 2022-11-23T08:38:36Z

In common.projects we have the information name (2014 Demer) and code (2014_demer). The information I'm looking for is the title (below) as provided in IMIS. Is that information also accessible through the ETN DB or do I need to query that from IMIS?

2014_DEMER - Acoustic telemetry data for four fish species in the Demer river (Belgium)

For the archival tags, I'll have to test with an example to see if I get the correct data. Do you know one by heart?

PieterjanVerhelst · 2022-11-23T09:07:19Z

@peterdesmet do you mean an example of a tag serial number? You can take sensor ID A16031 which is from an eel tagged in 2018.

peterdesmet · 2022-11-23T09:13:49Z

@PieterjanVerhelst thanks, I see that is a G5 pop-off tag with pressure and temperature sensors. In Darwin Core we express occurrences, i.e. observations/detections of an organism at a place and time. Am I correct in understanding that is not what archival tags record, unless they are a combined archival-acoustic tag?

PieterjanVerhelst · 2022-11-23T09:18:09Z

@peterdesmet the archival tags indeed register pressure and temperature. They do not log positions. Positions are obtained based on the logged pressure and temperature data through a modelling method called 'geolocation'. So if you want tracking data obtained from the archival tags into Darwin Core, this will be processed data. Note that geolocation modelling requires certain assumptions and links to specific databases, so when these are changed, a slightly different trajectory can be obtained. Or in other words: there is some error on the position. It is not as accurate as acoustic telemetry.

peterdesmet · 2022-11-23T09:19:17Z

Great, in that case I am keeping processed position data from archival data out of scope for Darwin Core.

jreubens · 2022-11-29T11:14:05Z

regarding 7. this information goes through IMIS.
In the ETN project metadata, there is a link to the IMIS record

peterdesmet · 2022-11-29T11:29:55Z

All fields are now mapped in #257. Remaining questions @jonpye @jreubens @jonasmortelmansvliz

samplingProtocol: acoustic detection or acoustic telemetry (to not confuse with sound recording) @jonpye prefers acoustic telemetry, me too.
parentEventID / eventID: currently consists of animal_id + tag_serial_number. Should the manufacturer be included with the serial number to avoid collisions? So 304_1187449 -> 304_VEMCO_1187449? Not necessary, see Export data as Darwin Core #256 (comment)
coordinateUncertaintyInMeters for human observations: these positions of capture, release, ... are likely taken by GPS. Assume coordinate precision of 0.001 degree (157m) and recording by GPS (30m)
coordinateUncertaintyInMeters for acoustic detections: here we have the uncertainty associated with the GPS coordinates of the receiver and the distance of the fish to the receiver. This can depend on many conditions. What would be a good fixed (upper) value for this uncertainty? 300m, 500m, 1000m? Assume coordinate precision of 0.001 degree (157m), recording by GPS (30m) and detection range of around 800m ≈ 1000m, see Export data as Darwin Core #256 (comment)
type is currently set to Event(current DwC recommendation). Is there a preference for more specific EventType? None of the values suggested at Agree on way to name events in Event Core (Type vocabulary) iobis/env-data#4 (comment) fit, so sticking with Event
lifeStage can currently contains verbatim values, such as FV, smolt. Should this be left as is or standardized to https://registry.gbif.org/vocabulary/LifeStage/concepts controlled, see Map lifeStage to controlled vocab in DwC #262
organismID: should this be 304 or etn:304. 304, see Export data as Darwin Core #256 (comment)

peterdesmet · 2022-11-29T16:39:48Z

@jdpye comments:

if tag serial number is just numeric, there's a very good chance two manufacturers will (or already have) used the same ### for a tag. If characters are cheap, we could use a short string to identify them. vemco-12343465, thelm-213512, lotek-2435321 etc.

Here are all the possible manufacturers:

VEMCO
THELMA BIOTEL
CHELONIA
OCEANINSTRUMENTS
LOTEK
RTSYS
B&K
SEICHE
SONOTRONICS
CEFAS TECHNOLOGY LIMITED
INNOVASEA
DESERT STAR
AANDERAA
STAR-ODDI

Are there some we should shorten?

peterdesmet · 2022-11-30T08:14:39Z

@jdpye, here's how the identifiers would differ if we include the manufacturer:

parentEventID | eventID             | samplingProtocol
------------- | ------------------- | ------------------
304_1187449   | 304_1187449_capture | capture
304_1187449   | 304_1187449_release | release
304_1187449   | 21676626            | acoustic telemetry
304_1187449   | 20744955            | acoustic telemetry

parentEventID       | eventID                   | samplingProtocol
------------------- | ------------------------- | ------------------
304_vemco:1187449   | 304_vemco:1187449_capture | capture
304_vemco:1187449   | 304_vemco:1187449_release | release
304_vemco:1187449   | 21676626                  | acoustic telemetry
304_vemco:1187449   | 20744955                  | acoustic telemetry

Note that within a dataset the identifiers are unique in both cases. Adding the manufacturer would only solve the use case where we want to combine multiple datasets where the animal_id+serial_number combination looks identical, but is actually different and they originate from a different manufacturer. I would argue that such clashes in name spaces are likely rare and can be solved by taking the source (e.g. datasetID or collectionCode=ETN) into account.

jdpye · 2022-11-30T14:22:55Z

oops, I see these now, I'm @jdpye here and @jonpye on Google-side.

I would say 1000m (or more) from a center point is reasonable. If you are looking for a flat upper bound for coordinateUncertaintyInMeters, the accepted range of a high-powered tag in open water is around that. We place gates 800m apart to cover most cases of degraded signal transmission. This is a highly variable situation, with range testing potentially able to provide a better answer, so with the caveat that we're going to try and do better if the research programme can tell us a better number, we can stick with 1000m as a generic 'acoustic telemetry' upper bound. This paper that describes an open-water experiment has a nice figure and a more nuanced view of how reasonably likely detections at each distance would be before taking into account other factors. https://animalbiotelemetry.biomedcentral.com/articles/10.1186/s40317-017-0142-y

In river systems and turbid waters you get a lot less detectability and a lot of dependency on environmental conditions: https://link.springer.com/article/10.1007/s10750-021-04556-3

So, how much of this can we even capture with a single (high?) number? And where we can do better, a paragraph in the metadata about how the range estimation was done at each station could be included. There are examples in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13322 of how to extrapolate conditions that drive variability once you can characterize them from a few exemplar stations, for example, and this could be one approach researchers use to quantify their predicted station range.

My take is, if they give us expected ranges, report those, and if they don't, shoot high on the detection variability, 1000m or even higher. And describe in the metadata what the source of the coordinateUncertainty was, a round high estimate based on the technologies in play, or a specific characterized estimate based on what the researchers were able to calculate.

peterdesmet · 2022-11-30T14:34:15Z

Thanks @jdpye, I'll have the write_dwc() function use 1000m as the default upper limit and refer to your comment. Users can always alter the output of the write_dwc() to be more precise.

Any feedback on default coordinateUncertainty for human obs (capture, release, ...) and regarding those identifiers?

See #256

jdpye · 2022-11-30T14:38:05Z

I don't have any info that would overrule the standard commercial GPS precision being ~30m, I'd be happy to roll with that.
coordinatePrecision might be an interesting companion to this, if data creators are recording only a few decimal places of location for their release/deployment points, would we hash that out in coordinatePrecision instead?

diniangela · 2022-11-30T14:41:58Z

For the standard vocabularies for lifestage, we at OTN hold a link to the NERC vocabulary: http://vocab.nerc.ac.uk/collection/S11/current/

peterdesmet · 2022-11-30T14:47:36Z

coordinateUncertainty for human obs
@jdpye We could go for 30m, but apparently I opted for 1000m for Movebank data, since the origin of the coordinates is often not known. I think the same is true here. 1000m is likely to capture most uncertainty, but thoughts welcome (and then I will change both). I wouldn't add coordinatePrecision, since that is often unknown as well (and you can't always reliably tell from the coordinate decimals).

coordinateUncertainty for detections
@jdpye Do you think it worth to add this as a parameter to the function, with 1000m as default?

lifestage
@diniangela ok, creating a separate issue for mapping: #262

jdpye · 2022-11-30T16:01:52Z

i like the idea of having a parameter for the function, but i can see the scope creeping. Would the user provide a different blanket value for the whole export, or a conditional blanket value, or be required to sub in their own per instrument or per event column of data?

The default is what nearly all existing data will be published with so we all definitely have to be happy with the 1000m being an acceptable signal that 'we don't know better'.

peterdesmet · 2022-12-01T09:42:21Z

@jdpye you're right, the user would just provide a blanket value, so I won't add a parameter. I'll set the value for the detections at 1000m (with the user always having the option to improve upon that before publishing).

peterdesmet · 2022-12-02T08:54:20Z

Another item @jdpye and I discussed on slack is whether organismID should be prepended with a name space to avoid name clashes. E.g. within a system like ETN, the identifier 304 is unique, but that is likely not the case when mixed with other datasets on e.g. OBIS/GBIF. In the absence of a central registry of animal identifiers (like WikiData) one idea is to express the organismID as etn:304 to make it globally unique.

I'm personally not a fan of this idea for two reasons:

Users (or scripts) are likely to search for the original identifier (304) to find records in OBIS/GBIF. If the namespace is included (etn:304) in the published data, they won't find any results.
Users are likely familiar with getting multiple datasets for a certain filter. They can then filter on datasetID=x or collectionCode=ETN to get the intended scope anyway.

I would therefore suggest to include the original identifier as is in the published data.

/cc @timrobertson100 @tucotuco @sarahcd

timrobertson100 · 2022-12-02T09:42:03Z

Thanks @peterdesmet

I think it's reasonable to say that GBIF would currently (always?) have to assume the identifiers are local and make use of approaches that use datasetID + organismID - simply because you know it can't be enforced across such a wide variety of publishers.

If you cared to join across ETN datasets, then having a prefix may help (i.e. your etn:304 example), but that may still have collisions with outsiders and most likely there would be alternative ways to make the join (e.g. something along the lines of organismID=304 AND network=ETN).

Aside: For identifier schemes like DOI, ARK, LSID etc. it may be more reasonable to assert relationships across publishers as it's more likely they do indeed mean the same thing, but that isn't what you're considering here.

Does that help?

peterdesmet · 2022-12-02T10:28:25Z

@timrobertson100 thanks, yes. It aligns with my thinking that using the original (non-prefixed) organismID is probably the best approach.

If you cared to join across ETN datasets, then having a prefix may help ...

We already make sure that the original (non-prefixed) organismID is unique within ETN. I.e. there is only one 304 across all datasets/studies in ETN, because it's the internal identifier assigned by the database (human assigned identifiers being unreliable). Adding etn: here doesn't change anything.

peterdesmet · 2022-12-02T14:50:44Z

All questions/issues (#256 (comment)), closing issue.

See #256

sarahcd · 2022-12-06T14:32:53Z

Belatedly, I agree with reasoning from @peterdesmet and @timrobertson100

peterdesmet added the enhancement label Nov 21, 2022

peterdesmet self-assigned this Nov 21, 2022

peterdesmet mentioned this issue Nov 23, 2022

Add write_dwc() function #257

Merged

peterdesmet added a commit that referenced this issue Nov 30, 2022

Set coordinateUncertainty for detections at 1000

80b89d6

See #256

peterdesmet mentioned this issue Nov 30, 2022

Map lifeStage to controlled vocab in DwC #262

Closed

peterdesmet closed this as completed Dec 2, 2022

peterdesmet added a commit that referenced this issue Dec 5, 2022

Set coordinateUncertainty

5841ab7

See #256

PietrH mentioned this issue Dec 7, 2022

with changes to helper functions, collapse_transformer() is no longer in use #264

Closed

MathewBiddle mentioned this issue Dec 7, 2022

adding R nc to DwC notebook ioos/ioos_code_lab#13

Merged

jdpye mentioned this issue Aug 21, 2023

[dataset]: Animal Satellite Telemetry data ioos/bio_data_guide#145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export data as Darwin Core #256

Export data as Darwin Core #256

peterdesmet commented Nov 21, 2022 •

edited

Loading

peterdesmet commented Nov 21, 2022 •

edited

Loading

peterdesmet commented Nov 21, 2022

jreubens commented Nov 22, 2022

peterdesmet commented Nov 22, 2022

peterdesmet commented Nov 22, 2022 •

edited

Loading

peterdesmet commented Nov 22, 2022

jreubens commented Nov 22, 2022

peterdesmet commented Nov 23, 2022

PieterjanVerhelst commented Nov 23, 2022

peterdesmet commented Nov 23, 2022

PieterjanVerhelst commented Nov 23, 2022

peterdesmet commented Nov 23, 2022

jreubens commented Nov 29, 2022

peterdesmet commented Nov 29, 2022 •

edited

Loading

peterdesmet commented Nov 29, 2022 •

edited

Loading

peterdesmet commented Nov 30, 2022

jdpye commented Nov 30, 2022

peterdesmet commented Nov 30, 2022

jdpye commented Nov 30, 2022

diniangela commented Nov 30, 2022

peterdesmet commented Nov 30, 2022 •

edited

Loading

jdpye commented Nov 30, 2022

peterdesmet commented Dec 1, 2022 •

edited

Loading

peterdesmet commented Dec 2, 2022

timrobertson100 commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

sarahcd commented Dec 6, 2022

Export data as Darwin Core #256

Export data as Darwin Core #256

Comments

peterdesmet commented Nov 21, 2022 • edited Loading

Human Observations

Detections

peterdesmet commented Nov 21, 2022 • edited Loading

peterdesmet commented Nov 21, 2022

jreubens commented Nov 22, 2022

peterdesmet commented Nov 22, 2022

peterdesmet commented Nov 22, 2022 • edited Loading

peterdesmet commented Nov 22, 2022

jreubens commented Nov 22, 2022

peterdesmet commented Nov 23, 2022

PieterjanVerhelst commented Nov 23, 2022

peterdesmet commented Nov 23, 2022

PieterjanVerhelst commented Nov 23, 2022

peterdesmet commented Nov 23, 2022

jreubens commented Nov 29, 2022

peterdesmet commented Nov 29, 2022 • edited Loading

peterdesmet commented Nov 29, 2022 • edited Loading

peterdesmet commented Nov 30, 2022

jdpye commented Nov 30, 2022

peterdesmet commented Nov 30, 2022

jdpye commented Nov 30, 2022

diniangela commented Nov 30, 2022

peterdesmet commented Nov 30, 2022 • edited Loading

jdpye commented Nov 30, 2022

peterdesmet commented Dec 1, 2022 • edited Loading

peterdesmet commented Dec 2, 2022

timrobertson100 commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

peterdesmet commented Dec 2, 2022

sarahcd commented Dec 6, 2022

peterdesmet commented Nov 21, 2022 •

edited

Loading

peterdesmet commented Nov 21, 2022 •

edited

Loading

peterdesmet commented Nov 22, 2022 •

edited

Loading

peterdesmet commented Nov 29, 2022 •

edited

Loading

peterdesmet commented Nov 29, 2022 •

edited

Loading

peterdesmet commented Nov 30, 2022 •

edited

Loading

peterdesmet commented Dec 1, 2022 •

edited

Loading