-
-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly sanitize MathML out of post content #27107
Conversation
This pull request has merge conflicts that must be resolved before it can be merged. |
2fe347c
to
2abeddc
Compare
This pull request has resolved merge conflicts and is ready for review. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #27107 /- ##
==========================================
- Coverage 85.01% 82.70% -2.31%
==========================================
Files 1059 1031 -28
Lines 28277 28182 -95
Branches 4538 4560 22
==========================================
- Hits 24040 23309 -731
- Misses 3074 3786 712
Partials 1163 1087 -76 ☔ View full report in Codecov by Sentry. |
well, codecov used to be 100% 😓 |
This is a tiny PR that would be incredibly beneficial to have, with MathML having now become evergreen a little over a year ago. A few academia-focused instances have merged it on their own (mathstodon.xyz and types.pl, primarily) but it is not very useful without broad adoption as instances can't send HTML that isn't going to render in most clients. It's really the weight of Mastodon that will let forks leave MathJax & co behind. @ClearlyClaire or @renchap could I re-poke you to give it a look over? |
At first glance, the code seems fine, although I'd need to spend significantly more time reviewing it to make sure it doesn't introduce any issue. I agree that outright losing information from a remote post is problematic, however, I am not sure how useful displaying raw LaTeX is, as even the simple example in the PR's description is a pain to read. |
Some instances already "support" math by enabling mathjax. For these instances the desired way of showing math is for some text on the site to have latex (surrounded by delimiters) which gets rendered to math through JS. But it really isn't the greatest solution overall. If the mastodon project was OK with displaying math then I could easily change the PR to allow mathML instead of swapping it. Otherwise I would prefer this PR gets merged so that information is not lost. The plaintext description could also be preferred over LaTeX, but I don't really see this annotation being widespread. |
For a version that supports MathML, but tries to scrub all but the necessary information see my glitch-soc PR: glitch-soc#1933 |
Ah, apologies - I got this PR and the glitch-soc PR mixed up and thought this one was also "don't strip MathML tags". I think not stripping MathML tags (glitch-soc#1933) would be really really nice to merge and beneficial to have (and is tiny!). Supporting MathML means that instances like mathstodon.xyz can target MathML directly, which Firefox and Chrome will automatically render, so posts containing math can be viewable on every instance without client-side patches. If that PR looks good to you all, maybe @4e554c4c can pull it in here instead? |
Hi, I'm one of the admins of mathstodon.xyz, and the person who wrote the LaTeX-rendering code for it. I'm keen to have a better solution for posts containing math, and I think this is a good first step. I told @4e554c4c that we had money available to support developer time sorting this out, but never did anything about that! If there's anything we can do, let me know, either here or by DM to @[email protected]. |
Allowing MathML is yet another can of worms. One very significant issue is the sanitization library we use explicitly does not support it: https://github.com/rgrove/sanitize?tab=readme-ov-file#usage |
This was my assumption from the start. Could we maybe go ahead with this PR then, so that information isn't lost? Displaying mathematics in a nice way can be another issue (it can be as simple as bundling KaTeX). In the meantime, this would allow other fediverse implementations to move forward with the FEP. |
2abeddc
to
9eb30df
Compare
rebased on main. |
This is not specific to mathML, but I wonder about some general pattern of trying to somehow preserve "payloads" like this that are in a status that would allow clients to pull it out and render (ie, some math js in this scenario), while not introducing a security issue or weird visual issue in regular usage. Could we throw any extracted markup into an (escaped) |
9eb30df
to
224192d
Compare
so i thought about this, but it's kind of unclear where that 'data' would appear in the actual post. this could be done in a similar way to emojos but that really seems overly complicated in this (mastodon specific!!) case. |
rebased on main and ready for review. |
sorry messed up lints, so i need the workflows to be approved again 😅 |
Returning to this, am I right that not wanting to see raw LaTeX in posts is the only thing holding this back? Since allowing MathML through requires the sanitizer library to support it, would it be worthwhile spending some time looking at how much of MathML Core the sanitizer could support? I think the potential for sneaky business in MathML Core is significantly less than in SVG, for example. |
IMO this shouldn't be a concern, since it's the current experience for non-math-instances interacting with math instances. But I would also like to hear what's holding this PR back |
Bumping this one as well ... looks like there was some initial hesitation about full mathml support, vague agreement that a minimum of "can we make mathml look less bad?" was a good goal (which the current version of this PR seems to do), one round of (addressed) feedback, and then it kind of fizzled? Does this need more code review? Does it need product/UX sign off? Is it (for whatever reason) not wanted/desired? Happy to help with another code review round if that's useful. Separately -- while the code here does not NEEd a rebase (no conflicts), I suspect a rebase/push here will a) get current CI version matrix running, and b) remove some of the annotation noise in the diff ... (moot point if this isn't going to go forward, but maybe useful if so...) |
736d50c
to
5ac279f
Compare
For reference, this PR always has been about making math look "less bad". Actually displaying math is an entirely different issue. I understand that there is a bigger security concern that
However, sanitize says "If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.". This PR specifically avoids adding any math tags to an allowlist, and after the transformer runs sanitize is free to remove all math elements. I believe this avoids the said security risk.
Done. Thanks for the push to update the PR :) |
Noted! I saw that was the conclusion, but wasn't totally sure on full commit/review/diff history.
Nice.
No prob, looks like all ran successfully on current matrix and the diff is now annotation-free for (maybe?) easier code review. |
This is outside of area of expertise, neither our focus for the foreseeable future. If anything, this should be done at https://github.com/rgrove/sanitize if they are willing to, but that itself would not guarantee inclusion of MathML Core in Mastodon, as we would still have to make a product design decision on this.
Full MathML support is out of question for the foreseeable future. This PR is about providing a fallback for MathML elements with annotations. It does so by replacing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two inline comments but seems ok to me otherwise.
There should be no security downside to this. There might be a very minor performance impact from the extra transformer. I am not sure how useful this is in practice, and maybe it would make more sense to use the alttext
attribute, which seems more fitting for our purpose. But also, I understand the reason is to favor the application/x-tex
annotation for client-side rendering using Javascript libraries.
The rendering from JS libraries would be an added bonus, but I'm not particularly concerned about this. I've looked far and wide, and haven't found very many resources which use I'm hesitant to consider I would be ok preferring the |
5ac279f
to
3c6fa2d
Compare
Summary: ------- This commit correctly sanitizes incoming MathML according to [FEP-dc88]. Instead of completely removing MathML nodes, it replaces them with their LaTeX or plain-text representation, so that the mathematics can be read in some form by mastodon users. Test Plan: ---------- ``` $ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb -f d Run options: exclude {:type=>#<Proc: ./spec/rails_helper.rb:79>} Randomized with seed 58854 Sanitize::Config ::MASTODON_STRICT sanitizes math blocks to LaTeX converts h1 to p strong removes "translate" attribute with invalid value removes a without href removes a without href and only keeps text content math sanitizer falls back to plaintext keeps ul prefers latex removes a with unparsable href keeps start and reversed attributes of ol removes a with unsupported scheme in href keeps a with translate="no" keeps a with href keeps a with supported scheme and no host does not re-interpret HTML when removing unsupported links sanitizes math to LaTeX Finished in 0.17323 seconds (files took 3.28 seconds to load) 16 examples, 0 failures Randomized with seed 58854 ``` observed 100% code coverage of `lib/sanitize_ext/sanitize_config.rb`. Ran mastodon locally, and fetched [reference post][nyancat] and observed that math was converted to plaintext form (and was not missing). [FEP-dc88]: https://codeberg.org/fediverse/fep/src/branch/main/fep/dc88/fep-dc88.md [tracking]: https://codeberg.org/fediverse/fep/issues/161 [socialhub]: https://socialhub.activitypub.rocks/t/fep-dc88-formatting-mathematics/3564 [nyancat]: https://nyan.network/notice/Aa4IvnBVHysWswRX1s Related Discussion: ------------------- Please see [FEP-dc88], the [FEP tracking issue][tracking] and [FEP forum discussion][socialhub] for more information. Fixes mastodon#26943
3c6fa2d
to
ad02502
Compare
Thank you very much, everyone, for getting this merged! |
Summary:
This commit correctly sanitizes incoming MathML according to FEP-dc88. Instead of completely removing MathML nodes, it replaces them with their LaTeX or plain-text representation, so that the mathematics can be read in some form by mastodon users.
Test Plan:
observed 100% code coverage of
lib/sanitize_ext/sanitize_config.rb
.Ran mastodon locally, and fetched reference post and observed that math was converted to plaintext form (and was not missing).
Related Discussion:
Please see FEP-dc88, the FEP tracking issue and FEP forum discussion for more information.
Fixes #26943