-
-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix, feat, docs: improve sax parser entity handling #3265
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
flavorjones
force-pushed
the
1926-improve-sax-parser-entity-handling
branch
from
July 1, 2024 21:04
828ca24
to
5a5279a
Compare
This was referenced Jul 1, 2024
flavorjones
force-pushed
the
1926-improve-sax-parser-entity-handling
branch
2 times, most recently
from
July 1, 2024 21:23
3a192e3
to
42a9a6c
Compare
flavorjones
force-pushed
the
1926-improve-sax-parser-entity-handling
branch
from
July 2, 2024 02:11
ad0064b
to
8ff5a10
Compare
also: - improve exception messages - and make the exception messages consistent between Java and C impls - and introduce some assert_pattern testing to simplify complex tests
and some other small nonfunctional changes
which will allow us to use libxml2's default SAX handlers if we wish, and simplifies some things.
On CRuby, this fixes the fact that the parser was registering errors when encountering general (non-predefined) entities. Now these entities are resolved properly and converted into `#characters` callbacks. Fixes #1926. On JRuby, the SAX parser now respects the `#replace_entities` attribute, which was previously ignored AND defaulted incorrectly to `true`. The default now matches CRuby -- `false` -- and the parser behavior matches CRuby with respect to entities. Fixes #614. This commit also includes some granular tests of how the sax parser handles different entities under different circumstances, which should be clarifying for user reports like #1284 and #1500 that expect predefined entities and character references to be treated like parsed entities (which they aren't).
The behavior here is relatively complex, being a function of entity type and `#replace_entities` value, but there are sufficient tests and both Java and C impls behave identically. Related to the problem described at #1926.
Previously, the Java impl callback passed an empty string for URI.
and add test coverage for HTML4 SAX parser exception handling
I'm probably doing something wrong, but I'm timing out on trying to fix it.
Should have been part of 46c1e10
Update docs for: - XML::SAX - XML::SAX::Document - XML::SAX::ParserContext - XML::SAX::PushParser - HTML4::SAX::PushParser
Before GNOME/libxml2@eddfbc38 the `#characters` event was duplicated in this test case. I'm not going to work around it in Nokogiri.
flavorjones
force-pushed
the
1926-improve-sax-parser-entity-handling
branch
from
July 2, 2024 02:17
8ff5a10
to
ca5480e
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem is this PR intended to solve?
#1926 described an issue wherein the SAX parser was not correctly resolving and replacing internal entities, and was instead reporting an error for each entity reference. This PR includes a fix for that problem.
I've removed the unnecessary "SAX tuple" from the SAX implementation, replacing it with the
_private
struct member that libxml2 makes available. Then I set up the parser context structs so that we can use libxml2's standard SAX callbacks where they're useful (which is how I addressed the above issue).This PR also introduces a new feature, a SAX handler callback
Document#reference
which allows callers to get entity-specific name and replacement text information (rather than relying on theDocument#characters
callback). This can be used to solve the original issue in #1926 with this code: searls/eiwa#11The behavior of the SAX parser with respect to entities is complex enough that I wrote up a short doc in the
XML::SAX::Document
docstring with a table and explanation. I've also added warnings to remind users that#replace_entities
is not safe to set when parsing untrusted documents.In the Java implementation, I've fixed the
#replace_entities
option in the SAX parser context and set it to the proper default (false
), fixing #614. I've also corrected the value of the URI argument toDocument#start_element_namespace
which was a blank string when it should have beennil
.I've added quite a bit of testing around the SAX parser's handling of entities.
I added and clarified quite a bit of documentation around SAX parsing generally. Exception messages have been clarified in a couple of places, and made consistent between the C and Java implementations. This should address questions asked in issues #1500 and #1284.
Finally, I cleaned up some of the C code that implements SAX parsing, naming functions more explicitly (and moving towards some kind of standard naming convention).
Closes #1926.
Closes #614.
Have you included adequate test coverage?
Yes!
Does this change affect the behavior of either the C or the Java implementations?
Yes, but the implementations are much more consistent with each other now.