-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUPS gives no visual indication of duplicate dictionary keys #34
Comments
Without looking at the code: I'm almost completely sure that RUPS simply lets iText Core do (most of) the heavy lifting when it comes to parsing, so RUPS only ever sees iText representations of PDF objects. I'm not sure how easy this feature would be to add without making sweeping changes in Core. One way I could see this being implemented is by using a "secondary" parser specifically written to collect information about problems with object serialisation/representation. The code probably won't be pretty, though... |
Matthias hit the nail on the head there, that dictionary collapsing functionality is more a product of iText Core backing the dictionaries in standards accepted manner and backing the dictionaries with a HashMap accordingly. Michaël and I have had a chat about how we could implement this in RUPS without having to change Core to be "looser" in it's standards implementation. |
Is there any way that just the presence of such an issue might be flagged or indicated (messages in the log? pop-up dialog?) rather than the heavy load of having to support duplicate keys in the PDF DOM tree? Even a non-specific message simply stating that the PDF contained one or more duplicate key names gives PDF forensic investigators something to start looking for, even if iText Core cannot report the object or key name (obviously more info is better but I understand the complexity issue). |
The way I would handle that, if it were up to me, would be to implement a "recoverable error recorder" on |
Michaël and I had a similar discussion to the same effect, we could expand the logging from Core and listen to the log events for situations in which Core has overridden invalid elements of the raw document. Still requires changes to Core to make it more verbose in the logs but won't require functionality changes. |
Just FYI and shameless self-promotion: I have hacked my Arlington TestGrammar PoC (C++) to now detect and report duplicate keys (mostly reliably) when using the hacked copy of pdfium that is in that repo. It is unfortunately finding more PDFs than I expected - thankfully most (but not all!) have the same key value. |
Yes, PDFs with duplicate keys are not allowed by ISO 32K but it would be useful to have some form of visualization to know that this may be the cause of some other error like when the keys have different values. At the moment this is invisible...
Example PDF: https://assets.devoted.com/plan-documents/2022/DH-DisenrollmentForm-2022-ENG.pdf
Object 19 has 2 keys
/h.32tc8hbyo16k
but with different values - AFAICT RUPS chooses the last key in the dictionaryThe text was updated successfully, but these errors were encountered: