-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BDQ Core - VOCABULARY of terms #152
Comments
Much (or all) of the vocabulary will come out of the framework as a technical specification, probably with additional supporting vocabularies (such as for values for data quality dimension). There is still a need to express the tests themselves as a formal specification (s.l.) and move this towards a TDWG standard. |
There may (probably) be terms associated with the Tests and Assertions beyond the Framework. The Framework's terms are broader than just Darwin Core - the TG2 tests need to define some terms that are outside the Framework (CORE is one that comes to mind. Whether it makes sense or not to expand the Framework terms to cover these is probably worth discussing. |
I propose that the framework should have a distinct vocabulary product consisting of the framework terms and their controlled vocabularies of values. I propose that there we a task group spawned specifically to create this. Tests and assertions rely on these for rigorous definition, so to me it has highest priority as a new vocabulary. |
@tucotuco sounds like a deliverable from TG1. |
Agree! |
Are there terms that we may be using in TG2 that are outside of the Framework such that we need a separate vocabulary (but where most terms link to the Framework, or to Darwin Core)? I will go through the Tests and pull out a list of terms for which I think may need definitions. |
We do have terms in TG2 that are beyond the Framework. I'll work with @ArthurChapman to generate the list. |
In the PSSR-CORE (Citizen Science) document from 2017 (https://www.wilsoncenter.org/sites/default/files/wilson_171204_meta_data_f2.pdf) One of the tasks mentioned in that report is Peter Brenton is going to send me the name of someone from that working group with whom we should liaise |
@Tasilee @pzermoglio and others. What columns do we require in our draft dq vocabulary? Some initial suggestions Term | Definition | Source | Reference | Link | GUID | |
Let me suggest the list of columns found in the header column in this Audubon Core source document: https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/audubon-column-mappings.csv In particular:
We should also check skos for appropriate skos terms for source, reference, and link. |
Thanks @chicoreus - I will look at that. We have a few different processes - the DQ Vocabulary - and that will depend a lot on what @pzermoglio comes up with. TG1 - Vocabulary will form a major part of the Vocabulary. I am looking at extracting the terms from the tests and just want to make sure we capture what we need at this stage so that we can then add the terms to the main Vocabulary we develop and not have to revisit things later. |
Those columns of @ArthurChapman are 95% the table from my keynote :) and ah yes, there is SKOS |
For Darwin Core, the full set of columns to manage the term definitions, usages, and examples is: iri,label,definition,comments,examples,organized_in,issued,status,replaces,rdf_type,term_iri,abcd_equivalence,flags |
From @tucotuco: "To me it is clear that the Framework will result in a vocabulary that should be made into a standard. To me this is separate from a possible standard arising from the tests and assertions. These are two distinct products to me, with the latter relying on the former, thus increasing the priority of the former." "Steve Baskauf clarified that the TDWG Standards Documentation Standard (SDS; https://www.tdwg.org/standards/sds/, in its single document, the "TDWG Standards Documentation Specification") describes how to create Data standards (for Vocabularies) as well as Best Current Practices documents. The Vocabularies of Values Best Current Practices Document must conform with that document, just as any vocabularies of values must also conform to the specifications set out in the SDS. The DQIG believes that a Vocabularies of Values Best Current Practices document is needed to provide more specific and common guidance on vocabularies of values construction and maintenance - for example, guidance on the type of vocabulary to use (Thesaurus, Vocabulary, Dictionary, Ontology, etc.), and how to deal with synonymy, multiple languages, etc." I suggest then a rename of this issue to "TG2-" and tags. Alan, Miles...can create a separate issue. :) |
But, the text for the CORE project here states that it is: "Links to Confirmed Tests and Assertions arising out of Task Group 2." If we expand CORE beyond this, we still have to define the new use cases, in detail, likely a several year task. Best route is to keep CORE to the research uses of what organisms occurred where when use case that came out of TG2, and specify an additional category of tests that aren't core. But if they are being put forward as part of the standard they will need to have at use cases to hang them off of. |
I would like to propose an alternative approach as none of us can afford years more of working on this before proposing a standard. I want to repeat my plea for simplicity and a decoupling of tests from use cases. I see the set of tests as parallel to the Darwin Core bag of terms and and the use cases as parallel to the distinct Darwin Core Archive cores in terms of combining Darwin Core terms for a particular purpose. I think the tests should not depend on use cases. It should be the other way around. I think a use case should be a level of construct (a profile we called it before) that brings together a set of tests on a declared set of Darwin Core terms and that can declare data quality measures based on their values. With this approach we could make the one occurrence use case based on the TG3 work as a model to show how that is done and let future work define new use cases as demand arises. These could be the stuff of task groups, and would be more tractable the less monolithic we make the standard. This is how I have thought about the BDQ work since the beginning, and ever more so now that the tribulations of coupled tests and use cases is creating a seemingly insurmountable obstacle. This would leave us free of the otherwise somewhat arbitrary and controversial distinctions of CORE, SUPPLEMENTARY and DO NOT IMPLEMENT. The ones we finalize for the use case would become part of the standard set of tests. The rest just remain documented in GitHub (with their labels, also documented), but nowhere in the standard documentation. Having all of the rest in the standard just seems like noise to me. Put in the standard that which is mature and useful for a given purpose and leave the rest as a solid basis for future work if demand arises. |
I 100% agree with @tucotuco. TG3 was a proof of concept and was never meant to be a comprehensive set of tests. I, like @tucotuco "This is how I have thought about the BDQ work since the beginning, and ever more so now that the tribulations of coupled tests and use cases is creating a seemingly insurmountable obstacle." I also have, until recently, seen the tests types as "somewhat arbitrary and controversial distinctions of CORE, SUPPLEMENTARY and DO NOT IMPLEMENT. The ones we finalize for the use case would become part of the standard set of tests." Originally treating these as basically a 1) a good test we can implement practically and that will be useful (CORE) 2). A tests that for the reasons stated in the definition are not ready to be Core tests and we only include them in the GitHub to not waste the work we've already done - not for the Standard but but may be mentioned (as a group, not individually) in the document for documentation purposes (SUPPLEMENTARY) 3). Tests that are close to being CORE but need more work because something is missing - e.g. a suitable Vocabulary (Immature/Incomplete) - note a couple may become CORE before we release the Standard if suitable Vocabularies become available and 4). Tests that for some reason we believe should not be implemented as doing so could lead to ambiguous or misleading results (DO NOT IMPLEMENT). I know that I, and others, are getting at the limit of my capacity to continue with this work and want to see it finished - so let's keep it simple. The suggestions of @tucotuco I support and would vote to continue in that direction - not getting bogged down on things that will not go in the Standard - including both non CORE tests, and detailed Use Cases for each test. We all know Use Cases exist for the tests but to fully document them all now would take another two years of work at least (103 tests @ conservatively 2 days work per test = 206 days work!). |
The problem is that we cannot describe tests within the framework
without attaching them to UseCases. Fittness for purpose is a
fundamental of the framework, and all tests within the framework
descend from a UseCase. As long as CORE matched up with the broad use
case that came out of TG3, research uses of data describing what
organisms occurred where when, we could ignore this, as all of the
tests we were working on hung off of that use case. The moment we
expanded to consider tests outside of that scope, we are forced to
define additional use cases. For supplementary tests, we can probably
get away with something very skeletal that later users of those tests
would replace with more clearly specified use cases. But we can't get
away with this for the set of tests now in CORE.
To use the framework, we cannot decouple the tests from use cases. We must provide at least one very clear example of how a set of tests links to a use case. This will be central to the normative RDF representation of the tests.
We can think of tests as a grab bag, and the framework enables people to assemble tests from that grab bag to fit a use case. But, central to the standard must be a demonstration of how to do that. We had that with CORE when it precicely overlapped the broad use case. We don't right now.
@tucotuco is probably pointing us in the right direction of a grab bag of tests, paralell to darwin core terms that can be assembled as needed by users for their use cases. But for this to work, and for us to be able to use the framework, we have to clearly specify at least one use case and link a set of tests to that use case to show how this is done for both quality assurance and quality control.
|
It's interesting to see everyone's perspectives in this, I appreciate this discussion, thank you! I don't know enough of what TG3 did to comment on the use cases, but I would like to share my perspective when I was mapping the checks from OBIS with the tests here:
The use case that we did in the OBIS data quality project team:
Another thing that I think MAY be helpful is to clarify what CORE is NOT. I used to think that CORE tests are the minimal set of tests that are needed to evaluate the fitness for use of a record regardless of the use case (basically minimal set of tests that overlap any biodiversity use case), but I believe that it is not the case? (please correct me if I am wrong) For example, the newly added tests for pathway #277, #278 I don't know if these are helpful, please ignore if it doesn't. Thank you all SO MUCH for your hard work, I know time does not come cheap - I am so thankful to have the opportunity to work with you all! |
Thanks @ymgan - great points and very helpful. One thing that springs to mind from your comments is that we can't document all use cases - if we followed the suggestions of @chicoreus, we would be making a random selection of a use case that certainly would not cover all cases. We currently have Examples that imply a use case, we cite where the test originated (ALA, VertNet, etc.) @chicoreus - as said before TG3 was never meant to be comprehensive, but an exemplar or proof of concept. I attended all the early meetings of TG3 - in setting it up, and most of the meetings and discussions. It was a proof-of-concept and looking at how Use Cases could be developed and from that came the use of User Stories. Part way through, it was decided to link to the Framework and several were tested in conjunction with @allankv. TG3 was not comprehensive and was never intended to be comprehensive, and the majority of the TG2 tests were never covered by TG3. TG2 from the start was looking at a good set of tests, based on DwC and that would be "Fundamental tests of biodiversity data represented in Darwin Core terms that are widely applicable, informative, and straight forward to implement." We looked t what had been done by ALA, GBIF, iDigBio, CRIA, BISON, VertNet and others. There was never an idea of linking it directly to the Use Cases that came out of TG3. We had most of the TG2 tests prepared long before TG3 started to get any results. For now we should just accept CORE as : "The set of mature tests that TG2 is putting forward as part of the standard. This is the bit meant by "Darwin Core terms that are widely applicable, informative, and straight forward to implement" Perhaps, in the Document - we can have a section on adding future tests, that includes a workflow that includes documenting a Use Case, determining if it was "widely applicable, informative, and straight forward to implement" then follow the existing template, develop tests for implementation, then test implementation, etc. |
Just back, briefly. I fully agree with @tucotuco and @ArthurChapman regards the circumscription of TG2 by TG3: Our tests are not bound by TG3 use cases. Our definition of core has been basically, as all have stated, "Tests that are widely applicable, informative, and straight forward to implement" with one exception: tests that we believe are 'aspirational' in encouraging a better best current practice (e.g., annotations). I (strongly) believe that it is also informative that we define what is not CORE (out of scope of the standard) as it helps to clarify what is CORE, and document the environment to inform future uses. Thanks @ymgan for your comments. Our 'Supplementary', 'Immature/Incomplete' and 'Do not implement' are useful and are now adequately documented. Like Arthur (as he well knows), I am also close to burnout on this work. We need to 'cut to the chase': Fill in gaps within the current CORE tests (e.g., test data - which I will do, and implementations) and get the standard document prepared. |
Altered definitions of bdqtag: terms CORE, Supplementary, Immature/Incomplete, and DO NOT IMPLEMENT following recent discussions via email. |
Added 5 new bdqffdq:UseCase terms for
|
Most of the bdq:Response contexts aren't correct, here are a set of corrections to be applied:
|
…f 2024-08-17. Edits to python markdown generation scripts to load vocabularies and display description of use cases before use case indexes in markdown documents. Regenerating markdown documents.
I've made an export of the vocabulary markdown table into https://github.com/tdwg/bdq/blob/master/tg2/vocabularies/combined_vocabulary.csv to give us something more easily sorted to look for inconsistencies and problems, and to start setting up to add vocabulary terms into various markdown documents. As a demonstration of linking in vocabulary terms, I've added the definitions of the use cases to the index by use case section of: https://github.com/tdwg/bdq/blob/master/tg2/core/generation/docs/core_tests.md For the time being, the markdown table in this issue remains the authoritative copy for editing, and we expect to overwrite the csv export. |
Following advice from @chicoreus, "Context" changed to bdqTestField:Term-Actions for the following terms bdq:ASSUMEDDEFAULT and Context for |
Added new term | bdq:AllValidationTestsRunOnSingleRecord | AllValidationTestsRunOnSingleRecord | A list of Core Validation Tests that have been run on a Single Record. | bdqffdq:InformationElements | Used in Measure of Single Record Tests | |
Added new term | bdq:AllAmendmentTestsRunOnSingleRecord | AllAmendmentTestsRunOnSingleRecord | A list of Amendments that have been run on a Single Record. | bdqffdq:InformationElements | Used in Measure of Single Record Tests | |
Added new term | bdq:assumptionOnUnknownHabitat | assumptionOnUnknownHabitat | Used when a bdq:taxonomyIsMarine source authority is unable to assert the marine or non-marine status of a taxon, the habitat (Marine/NonMarine) to assume instead or NoAssumption. | bdq:Parameter | See VALIDATION_COORDINATES_TERRESTRIALMARINE (b9c184ce-a859-410c-9d12-71a338200380). | |
Changed bdq:assumptionOnUnknownHabitat to bdq:assumptionOnUnknownBiome |
…r markdown tables in issues not quite aligned with the standard exported from #152 as an informal glossary file, to go as tables in a supplemenal document on the rationale management process for test development.
I saw bdq/tg2/_review/BDQ_Core_Introduction.md Line 179 in 3f1e1d4
Is NOT_REPORTED being used by any MEASURE please? If so, it is not in the vocab |
Thanks @ArthurChapman ! Then I guess we need a bdq:NOT_REPORTED, it is not in the table above |
We are just taking that term out of the test (#31), because it does not make sense. So that can be deleted from the document. |
On Thu, 22 Aug 2024 08:32:47 -0700 Yi-Ming Gan ***@***.***> wrote:
Thanks @ArthurChapman ! Then I guess we need a bdq:NOT_REPORTED, it
is not in the table above
Turns out we don't. It is a path that can't be reached in that test, and it isn't a framework response.status value.
|
The Vocabulary terms in this file have been split into other files - a bdqdim vocabulary, a bdqffdq vocabulary, a bdq:directory, and a glossary and these files are being generated as csv files and markdown for the final BDQ Core Standard. They are currently generated and are in the _review folder. As such this file is no longer maintained. |
Terms in the bdqffdq namespace are from the Fitness for Use Framework (Viega et al. 2017). Use the reference to the Framework Definitions for more details and examples. The use of a vocabulary term in a test specification without a namespace prefix (sometimes represented in all UPPER CASE), implies that the bdq: or bdqffdq: namespace is applicable. Note that wherever "DQ" is used in a definition it implies "Data Quality" and wherever "FFU Framework" is used it refers to the "Fitness for Use Framework" (Veiga et al. 2017).
Note: There are two tables in this issue, the first is for vocabulary for the standard, the second is for additional terms for supplement files that will go into tables in those documents rather than controlled vocabularies.
Do not edit, moved to csv files
Pending further splits, this vocabulary moved to https://github.com/tdwg/bdq/blob/master/tg2/vocabularies/combined_vocabulary.csv
Supplement: GitHub Label Terms
These are terms that are outside the Standard but that have been used as either GitHub Labels or TestFields in the BDQ GitHub
Do not edit, moved to csv files
**Pending further moves, this vocabulary moved to https://github.com/tdwg/bdq/blob/master/tg2/vocabularies/glossary_terms.csv **
The text was updated successfully, but these errors were encountered: