Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schematron bug, related to ISO 15511 regular expression pattern #549

Open
2 of 9 tasks
fordmadox opened this issue Jul 30, 2022 · 5 comments
Open
2 of 9 tasks

Schematron bug, related to ISO 15511 regular expression pattern #549

fordmadox opened this issue Jul 30, 2022 · 5 comments

Comments

@fordmadox
Copy link
Member

fordmadox commented Jul 30, 2022

While testing the new schematron file for EAC 2.0, I noticed that the regex borrowed from the EAD3 schematron has a small bug. For example, the following value is valid according to the EAD3 schematron:

US-oclc-12345678901

However, that is a fake 19 digit code, which should NOT be valid. That same 19-digit code is, correctly, not valid in EAD2002 nor EAC 1.0.

I am going to recreate that pattern for EAC 2.0 by following, essentially, the EAD2002 model, which does validate the country code, when present. Since we are validating the country code elsewhere, it seems like we should do that here, as well, rather than just using a two-character match pattern for that. Anyhow, here's the current EAD3 regex:

(^([A-Z]{2})|([a-zA-Z]{1})|([a-zA-Z]{3,4}))(-[a-zA-Z0-9:/-]{1,11})$

Whereas that should probably be (though NOT tested):

^(([A-Z]{2})|([a-zA-Z]{1})|([a-zA-Z]{3,4}))(-[a-zA-Z0-9:/-]{1,11})$

To decide:

Should we:

  1. update the regex as is so that invalid codes up to 19 digits will not be able to validate (the max length is 16 digits)?
  2. update the regex to ensure that a country code, when present, is also valid (as was done with EAD2002, and will be done in the new approach)?
  3. ignore this bug altogether (outside of documenting it) since it likely does not impact anyone at all?

Another example: right now, the following is also valid in EAD3:

XX-1

Whereas that same fake code is correctly not valid in EAD2002 (though it is in EAC-CPF 1.0, which switched to a pure regex validation).

Creator of issue

The issue relates to

  • EAC-CPF schema issue
  • EAC-CPF Tag Library issue
  • EAD schema issue
  • EAD Tag Library issue
  • Schema issue
  • Tag Library issue
  • Suggestions for all schemas
  • Suggestions for all Tag Libraries
  • Other

Wanted change/feature

  • Text:

Reporting a bug

  • Text:

Suggested Solution

  • Text:

Steps to Reproduce (for bugs)

Context

  • Text:

Your Environment can be a clue to a bug

  • Version used:
  • Environment name and version (e.g. Chrome 39, node.js 5.4):
  • Operating System and version (desktop or mobile):
@fordmadox
Copy link
Member Author

fordmadox commented Jul 30, 2022

Just to follow up, I tested reversing the first two characters of the current regex, and that does indeed fix the issue.

@fordmadox
Copy link
Member Author

See SAA-SDT/eas-schematrons@afe49d1 for the patch.

I'm still planning to update this in the new Schematron to use the country codes, however.

@kerstarno
Copy link
Contributor

Hi @fordmadox,

I agree that we should restrict a repository/maintenance agency code that is declared to be ISO 15511 compliant to maximal 16 characters. However, XX actually is a valid country code as it is part of the ranges that can be user-assigned. What it stands for might be different from one context to another, but against ISO 3166-1 it is valid.

"User-assigned codes - If users need code elements to represent country names not included in ISO 3166-1, the series of letters AA, QM to QZ, XA to XZ, and ZZ, and the series AAA to AAZ, QMA to QZZ, XAA to XZZ, and ZZA to ZZZ respectively, and the series of numbers 900 to 999 are available." (https://www.iso.org/glossary-for-iso-3166.html)

@fordmadox
Copy link
Member Author

fordmadox commented Aug 1, 2022

@kerstarno Regarding "XX" and ISO 3166, or any codes reserved for private use (e.g. 'qab' in ISO 639-2), I wonder if we should still flag those as invalid.

Given that there is no agreement about what those codes can represent, shouldn't we expect a user to record their usage within the control section, and also set the code list "otherCountryEncoding"?.

That country code (not to mention the numeric equivalents, and the 3-character user-assigned options) was never valid in EAD2002 nor EAD3... though any 2-character A-Z code could be used in the agency code heading in EAD3, which would make this type of error especially odd:

        <maintenanceagency countrycode="XX">
            <agencycode>XX-1</agencycode>
        </maintenanceagency>

Where the maintenanceagency element is invalid in EAD3, but the agencycode element is valid!

Quite the mixed message, there 😄

Also, it looks like the regular expression test in EAC 1.0 for country codes was limited to any 2-digit or 4-digit A-Z code.

Given all that, I do prefer EAD3's approach to the country code validation (not the ISIL one, though, due to the discrepancy highlighted above).

@kerstarno
Copy link
Contributor

@fordmadox - I see your point about it not being clear what "XX" (or any other of these user assigned codes) stands for specifically, but they are part of the ISO 3166, so "otherCountryEncoding" would not necessarily be correct, I'd say.

Also, with the officially assigned codes we only check whether they are part of the ISO standard, we don't necessarily relate them to the appropriate country names, right? I mean, for validation, we don't really care, whether "XX" stands for "Country A" or "Country B", do we?

Maybe there's a possibility to let these codes validate, but to flag them as user-assigned? Same as we discussed with regard to deprecated codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants