Xml compliance against control characters #62
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I checked all the pertaining RFCs, and not only XMPP states being defined for XML 1.0 fifth edition, but also both XML 1.0 or XML 1.1 wouldn’t allow such character on the body (1.1, which is barely used anyway, allows such characters in very special places, but the cdata is not one of them).
Also indeed, I can confirm that our XML parser accepts the escape char inside the body, in the screenshot below we can see that a body with just a space is quickly stripped, but the one with the escape character keeps the \e code in the xml cdata section:

Not only that, but the XML specification says that only the following chars are allowed:
and then I can confirm that the xml parser also accepts 0x1c, 0x1d, etc.
The fix is to now, on every character being processed, check if such character is any of the disallowed ones. A few tests are added. This unfortunately has an impact on performance, with an average worsening of around I guess 10-15%
benchmarks
I have the xml payloads from this test generated in my computer, feel free to ask to share them
A possible solution would be to make it configurable, so that the control char validation could be given as a flag to the erlang call. But on the C level that would probably make the code size grow considerably, as part of the optimisations in rapidxml is the auto-generated code for each branch, instead of one code with many branching conditionals.