Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix, feat, docs: improve sax parser entity handling #3265

Merged
merged 14 commits into from
Jul 2, 2024

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Jul 1, 2024

What problem is this PR intended to solve?

#1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the _private struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue).

This PR also introduces a new feature, a SAX handler callback Document#reference which allows callers to get entity-specific name and replacement text information (rather than relying on the Document#characters callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the XML::SAX::Document docstring with a table and explanation. I've also added warnings to remind users that #replace_entities is not safe to set when parsing untrusted documents.

In the Java implementation, I've fixed the #replace_entities option in the SAX parser context and set it to the proper default (false), fixing #614. I've also corrected the value of the URI argument to Document#start_element_namespace which was a blank string when it should have been nil.

I've added quite a bit of testing around the SAX parser's handling of entities.

I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention).

Closes #1926.
Closes #614.

Have you included adequate test coverage?

Yes!

Does this change affect the behavior of either the C or the Java implementations?

Yes, but the implementations are much more consistent with each other now.

also:
- improve exception messages
- and make the exception messages consistent between Java and C impls
- and introduce some assert_pattern testing to simplify complex tests
and some other small nonfunctional changes
which will allow us to use libxml2's default SAX handlers if we wish,
and simplifies some things.
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
The behavior here is relatively complex, being a function of entity
type and `#replace_entities` value, but there are sufficient tests and
both Java and C impls behave identically.

Related to the problem described at #1926.
Previously, the Java impl callback passed an empty string for URI.
and add test coverage for HTML4 SAX parser exception handling
I'm probably doing something wrong, but I'm timing out on trying to
fix it.
Update docs for:

- XML::SAX
- XML::SAX::Document
- XML::SAX::ParserContext
- XML::SAX::PushParser
- HTML4::SAX::PushParser
Before GNOME/libxml2@eddfbc38 the `#characters` event was duplicated
in this test case. I'm not going to work around it in Nokogiri.
@flavorjones flavorjones force-pushed the 1926-improve-sax-parser-entity-handling branch from 8ff5a10 to ca5480e Compare July 2, 2024 02:17
@flavorjones flavorjones merged commit 1486c81 into main Jul 2, 2024
131 of 132 checks passed
@flavorjones flavorjones deleted the 1926-improve-sax-parser-entity-handling branch July 2, 2024 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant