RUPS gives no visual indication of duplicate dictionary keys #34

petervwyatt · 2022-11-15T18:36:37Z

Yes, PDFs with duplicate keys are not allowed by ISO 32K but it would be useful to have some form of visualization to know that this may be the cause of some other error like when the keys have different values. At the moment this is invisible...

Example PDF: https://assets.devoted.com/plan-documents/2022/DH-DisenrollmentForm-2022-ENG.pdf
Object 19 has 2 keys /h.32tc8hbyo16k but with different values - AFAICT RUPS chooses the last key in the dictionary

The text was updated successfully, but these errors were encountered:

MatthiasValvekens · 2022-11-15T18:52:42Z

Without looking at the code: I'm almost completely sure that RUPS simply lets iText Core do (most of) the heavy lifting when it comes to parsing, so RUPS only ever sees iText representations of PDF objects. I'm not sure how easy this feature would be to add without making sweeping changes in Core.

One way I could see this being implemented is by using a "secondary" parser specifically written to collect information about problems with object serialisation/representation. The code probably won't be pretty, though...

Minothor · 2022-11-18T10:36:50Z

Matthias hit the nail on the head there, that dictionary collapsing functionality is more a product of iText Core backing the dictionaries in standards accepted manner and backing the dictionaries with a HashMap accordingly.

Michaël and I have had a chat about how we could implement this in RUPS without having to change Core to be "looser" in it's standards implementation.
Again Matthias is right on the not-pretty part, some solutions would be prettier than others, but more like a parade of breed-standard english bulldogs and pugs - in the eye of the beholder.

petervwyatt · 2022-11-23T04:16:24Z

Is there any way that just the presence of such an issue might be flagged or indicated (messages in the log? pop-up dialog?) rather than the heavy load of having to support duplicate keys in the PDF DOM tree?

Even a non-specific message simply stating that the PDF contained one or more duplicate key names gives PDF forensic investigators something to start looking for, even if iText Core cannot report the object or key name (obviously more info is better but I understand the complexity issue).

MatthiasValvekens · 2022-11-23T08:28:25Z

The way I would handle that, if it were up to me, would be to implement a "recoverable error recorder" on PdfReader or PdfDocument in Core, and have iText write messages to that thing whenever it makes an explicit decision to ignore some situation that isn't allowed by the spec. RUPS could then query that. Still requires changes in Core, but at least it doesn't change current API behaviour...

Minothor · 2022-11-23T09:14:20Z

Michaël and I had a similar discussion to the same effect, we could expand the logging from Core and listen to the log events for situations in which Core has overridden invalid elements of the raw document.

Still requires changes to Core to make it more verbose in the logs but won't require functionality changes.

petervwyatt · 2022-11-30T09:06:43Z

Just FYI and shameless self-promotion: I have hacked my Arlington TestGrammar PoC (C++) to now detect and report duplicate keys (mostly reliably) when using the hacked copy of pdfium that is in that repo. It is unfortunately finding more PDFs than I expected - thankfully most (but not all!) have the same key value.

petervwyatt mentioned this issue Nov 24, 2022

Report duplicate keys in TestGrammar C++ PoC pdf-association/arlington-pdf-model#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUPS gives no visual indication of duplicate dictionary keys #34

RUPS gives no visual indication of duplicate dictionary keys #34

petervwyatt commented Nov 15, 2022

MatthiasValvekens commented Nov 15, 2022

Minothor commented Nov 18, 2022

petervwyatt commented Nov 23, 2022

MatthiasValvekens commented Nov 23, 2022

Minothor commented Nov 23, 2022

petervwyatt commented Nov 30, 2022

RUPS gives no visual indication of duplicate dictionary keys #34

RUPS gives no visual indication of duplicate dictionary keys #34

Comments

petervwyatt commented Nov 15, 2022

MatthiasValvekens commented Nov 15, 2022

Minothor commented Nov 18, 2022

petervwyatt commented Nov 23, 2022

MatthiasValvekens commented Nov 23, 2022

Minothor commented Nov 23, 2022

petervwyatt commented Nov 30, 2022