Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly sanitize MathML out of post content #27107

Merged
merged 3 commits into from
Nov 28, 2024

Conversation

4e554c4c
Copy link
Contributor

Summary:

This commit correctly sanitizes incoming MathML according to FEP-dc88. Instead of completely removing MathML nodes, it replaces them with their LaTeX or plain-text representation, so that the mathematics can be read in some form by mastodon users.

Test Plan:

$ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb -f d
Run options: exclude {:type=>#<Proc: ./spec/rails_helper.rb:79>}

Randomized with seed 58854

Sanitize::Config
  ::MASTODON_STRICT
    sanitizes math blocks to LaTeX
    converts h1 to p strong
    removes "translate" attribute with invalid value
    removes a without href
    removes a without href and only keeps text content
    math sanitizer falls back to plaintext
    keeps ul
    prefers latex
    removes a with unparsable href
    keeps start and reversed attributes of ol
    removes a with unsupported scheme in href
    keeps a with translate="no"
    keeps a with href
    keeps a with supported scheme and no host
    does not re-interpret HTML when removing unsupported links
    sanitizes math to LaTeX

Finished in 0.17323 seconds (files took 3.28 seconds to load)
16 examples, 0 failures

Randomized with seed 58854

observed 100% code coverage of lib/sanitize_ext/sanitize_config.rb.

Ran mastodon locally, and fetched reference post and observed that math was converted to plaintext form (and was not missing).
image

Related Discussion:

Please see FEP-dc88, the FEP tracking issue and FEP forum discussion for more information.

Fixes #26943

Copy link
Contributor

github-actions bot commented Nov 6, 2023

This pull request has merge conflicts that must be resolved before it can be merged.

Copy link
Contributor

This pull request has resolved merge conflicts and is ready for review.

Copy link

codecov bot commented Nov 29, 2023

Codecov Report

Attention: Patch coverage is 83.33333% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 82.70%. Comparing base (86500e3) to head (2abeddc).
Report is 609 commits behind head on main.

❗ Current head 2abeddc differs from pull request most recent head 9eb30df. Consider uploading reports for the commit 9eb30df to get more accurate results

Files Patch % Lines
lib/sanitize_ext/sanitize_config.rb 83.33% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #27107       /-   ##
==========================================
- Coverage   85.01%   82.70%   -2.31%     
==========================================
  Files        1059     1031      -28     
  Lines       28277    28182      -95     
  Branches     4538     4560       22     
==========================================
- Hits        24040    23309     -731     
- Misses       3074     3786      712     
  Partials     1163     1087      -76     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@4e554c4c
Copy link
Contributor Author

well, codecov used to be 100% 😓

@omentic
Copy link

omentic commented Apr 22, 2024

This is a tiny PR that would be incredibly beneficial to have, with MathML having now become evergreen a little over a year ago. A few academia-focused instances have merged it on their own (mathstodon.xyz and types.pl, primarily) but it is not very useful without broad adoption as instances can't send HTML that isn't going to render in most clients. It's really the weight of Mastodon that will let forks leave MathJax & co behind.

@ClearlyClaire or @renchap could I re-poke you to give it a look over?

@renchap renchap requested a review from a team April 23, 2024 06:20
@renchap renchap added this to the 4.3.0 milestone Apr 23, 2024
@ClearlyClaire
Copy link
Contributor

At first glance, the code seems fine, although I'd need to spend significantly more time reviewing it to make sure it doesn't introduce any issue.

I agree that outright losing information from a remote post is problematic, however, I am not sure how useful displaying raw LaTeX is, as even the simple example in the PR's description is a pain to read.

@4e554c4c
Copy link
Contributor Author

I agree that outright losing information from a remote post is problematic, however, I am not sure how useful displaying raw LaTeX is, as even the simple example in the PR's description is a pain to read.

Some instances already "support" math by enabling mathjax. For these instances the desired way of showing math is for some text on the site to have latex (surrounded by delimiters) which gets rendered to math through JS.
So this would integrate with that sort of math display automatically.

But it really isn't the greatest solution overall.

If the mastodon project was OK with displaying math then I could easily change the PR to allow mathML instead of swapping it. Otherwise I would prefer this PR gets merged so that information is not lost.

The plaintext description could also be preferred over LaTeX, but I don't really see this annotation being widespread.

@4e554c4c
Copy link
Contributor Author

For a version that supports MathML, but tries to scrub all but the necessary information see my glitch-soc PR: glitch-soc#1933
The goal of that PR was to get rid of any non-semantic information from the mathml (e.g. formatting info, that would clash with mastodon's style) which might be the right approach here too.

@omentic
Copy link

omentic commented Apr 24, 2024

I agree that outright losing information from a remote post is problematic, however, I am not sure how useful displaying raw LaTeX is, as even the simple example in the PR's description is a pain to read.

Ah, apologies - I got this PR and the glitch-soc PR mixed up and thought this one was also "don't strip MathML tags". I think not stripping MathML tags (glitch-soc#1933) would be really really nice to merge and beneficial to have (and is tiny!). Supporting MathML means that instances like mathstodon.xyz can target MathML directly, which Firefox and Chrome will automatically render, so posts containing math can be viewable on every instance without client-side patches.

If that PR looks good to you all, maybe @4e554c4c can pull it in here instead?

@christianp
Copy link

Hi, I'm one of the admins of mathstodon.xyz, and the person who wrote the LaTeX-rendering code for it. I'm keen to have a better solution for posts containing math, and I think this is a good first step.

I told @4e554c4c that we had money available to support developer time sorting this out, but never did anything about that! If there's anything we can do, let me know, either here or by DM to @[email protected].

@ClearlyClaire
Copy link
Contributor

Allowing MathML is yet another can of worms. One very significant issue is the sanitization library we use explicitly does not support it: https://github.com/rgrove/sanitize?tab=readme-ov-file#usage

@4e554c4c
Copy link
Contributor Author

Allowing MathML is yet another can of worms.

This was my assumption from the start.

Could we maybe go ahead with this PR then, so that information isn't lost? Displaying mathematics in a nice way can be another issue (it can be as simple as bundling KaTeX). In the meantime, this would allow other fediverse implementations to move forward with the FEP.

@4e554c4c
Copy link
Contributor Author

rebased on main.

@mjankowski
Copy link
Contributor

This is not specific to mathML, but I wonder about some general pattern of trying to somehow preserve "payloads" like this that are in a status that would allow clients to pull it out and render (ie, some math js in this scenario), while not introducing a security issue or weird visual issue in regular usage.

Could we throw any extracted markup into an (escaped) data attribute or something? This would allow native clients or JS hooks to look in a well-known place for that sort of payload.

@renchap renchap removed this from the 4.3.0 milestone Aug 14, 2024
@4e554c4c
Copy link
Contributor Author

This is not specific to mathML, but I wonder about some general pattern of trying to somehow preserve "payloads" like this that are in a status that would allow clients to pull it out and render (ie, some math js in this scenario), while not introducing a security issue or weird visual issue in regular usage.

Could we throw any extracted markup into an (escaped) data attribute or something? This would allow native clients or JS hooks to look in a well-known place for that sort of payload.

so i thought about this, but it's kind of unclear where that 'data' would appear in the actual post. this could be done in a similar way to emojos but that really seems overly complicated in this (mastodon specific!!) case.

@4e554c4c 4e554c4c requested a review from ClearlyClaire August 15, 2024 13:46
@4e554c4c
Copy link
Contributor Author

rebased on main and ready for review.

@4e554c4c
Copy link
Contributor Author

sorry messed up lints, so i need the workflows to be approved again 😅

@christianp
Copy link

Returning to this, am I right that not wanting to see raw LaTeX in posts is the only thing holding this back?

Since allowing MathML through requires the sanitizer library to support it, would it be worthwhile spending some time looking at how much of MathML Core the sanitizer could support? I think the potential for sneaky business in MathML Core is significantly less than in SVG, for example.

@4e554c4c
Copy link
Contributor Author

Returning to this, am I right that not wanting to see raw LaTeX in posts is the only thing holding this back?

IMO this shouldn't be a concern, since it's the current experience for non-math-instances interacting with math instances.

But I would also like to hear what's holding this PR back

@mjankowski
Copy link
Contributor

But I would also like to hear what's holding this PR back

Bumping this one as well ... looks like there was some initial hesitation about full mathml support, vague agreement that a minimum of "can we make mathml look less bad?" was a good goal (which the current version of this PR seems to do), one round of (addressed) feedback, and then it kind of fizzled?

Does this need more code review? Does it need product/UX sign off? Is it (for whatever reason) not wanted/desired? Happy to help with another code review round if that's useful.

Separately -- while the code here does not NEEd a rebase (no conflicts), I suspect a rebase/push here will a) get current CI version matrix running, and b) remove some of the annotation noise in the diff ... (moot point if this isn't going to go forward, but maybe useful if so...)

@4e554c4c
Copy link
Contributor Author

Bumping this one as well ... looks like there [...] vague agreement that a minimum of "can we make mathml look less bad?" was a good goal (which the current version of this PR seems to do), one round of (addressed) feedback, and then it kind of fizzled?

For reference, this PR always has been about making math look "less bad". Actually displaying math is an entirely different issue.

I understand that there is a bigger security concern that

... the sanitization library we use explicitly does not support it: https://github.com/rgrove/sanitize?tab=readme-ov-file#usage

However, sanitize says "If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.".

This PR specifically avoids adding any math tags to an allowlist, and after the transformer runs sanitize is free to remove all math elements. I believe this avoids the said security risk.

Separately -- while the code here does not NEEd a rebase (no conflicts), I suspect a rebase/push here will a) get current CI version matrix running, and b) remove some of the annotation noise in the diff ... (moot point if this isn't going to go forward, but maybe useful if so...)

Done. Thanks for the push to update the PR :)

@mjankowski
Copy link
Contributor

For reference, this PR always has been about making math look "less bad". Actually displaying math is an entirely different issue.

Noted! I saw that was the conclusion, but wasn't totally sure on full commit/review/diff history.

This PR specifically avoids adding any math tags to an allowlist, and after the transformer runs sanitize is free to remove all math elements. I believe this avoids the said security risk.

Nice.

Done. Thanks for the push to update the PR :)

No prob, looks like all ran successfully on current matrix and the diff is now annotation-free for (maybe?) easier code review.

@4e554c4c 4e554c4c changed the title Sanitize MathML in post content Correctly sanitize MathML out of post content Nov 25, 2024
@ClearlyClaire
Copy link
Contributor

Since allowing MathML through requires the sanitizer library to support it, would it be worthwhile spending some time looking at how much of MathML Core the sanitizer could support? I think the potential for sneaky business in MathML Core is significantly less than in SVG, for example.

This is outside of area of expertise, neither our focus for the foreseeable future. If anything, this should be done at https://github.com/rgrove/sanitize if they are willing to, but that itself would not guarantee inclusion of MathML Core in Mastodon, as we would still have to make a product design decision on this.

looks like there was some initial hesitation about full mathml support

Full MathML support is out of question for the foreseeable future. This PR is about providing a fallback for MathML elements with annotations. It does so by replacing the math element with a text representation of a application/x-tex or text/plain annotation, in that order. This seems fairly sensible, although of limited interest. It may make sense to consider the alttext attribute of the math element.

ClearlyClaire
ClearlyClaire previously approved these changes Nov 26, 2024
Copy link
Contributor

@ClearlyClaire ClearlyClaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two inline comments but seems ok to me otherwise.

There should be no security downside to this. There might be a very minor performance impact from the extra transformer. I am not sure how useful this is in practice, and maybe it would make more sense to use the alttext attribute, which seems more fitting for our purpose. But also, I understand the reason is to favor the application/x-tex annotation for client-side rendering using Javascript libraries.

lib/sanitize_ext/sanitize_config.rb Outdated Show resolved Hide resolved
lib/sanitize_ext/sanitize_config.rb Outdated Show resolved Hide resolved
@4e554c4c
Copy link
Contributor Author

4e554c4c commented Nov 26, 2024

maybe it would make more sense to use the alttext attribute, which seems more fitting for our purpose. But also, I understand the reason is to favor the application/x-tex annotation for client-side rendering using Javascript libraries.

The rendering from JS libraries would be an added bonus, but I'm not particularly concerned about this. I've looked far and wide, and haven't found very many resources which use alttext for ordinary equations (display or inline math). I've read through a few accessibility guidelines for math on the web, and most of them tend to suggest mathml as being the accessible option, and don't mention alt text.

I'm hesitant to consider alttext, since if tools don't use it it'll likely stay a dead code path here.

I would be ok preferring the text/plain annotation over the application/x-tex one though.

4e554c4c and others added 3 commits November 26, 2024 22:04
Summary:
-------
This commit correctly sanitizes incoming MathML according to [FEP-dc88].
Instead of completely removing MathML nodes, it replaces them with their
LaTeX or plain-text representation, so that the mathematics can be read
in some form by mastodon users.

Test Plan:
----------
```
$ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb -f d
Run options: exclude {:type=>#<Proc: ./spec/rails_helper.rb:79>}

Randomized with seed 58854

Sanitize::Config
  ::MASTODON_STRICT
    sanitizes math blocks to LaTeX
    converts h1 to p strong
    removes "translate" attribute with invalid value
    removes a without href
    removes a without href and only keeps text content
    math sanitizer falls back to plaintext
    keeps ul
    prefers latex
    removes a with unparsable href
    keeps start and reversed attributes of ol
    removes a with unsupported scheme in href
    keeps a with translate="no"
    keeps a with href
    keeps a with supported scheme and no host
    does not re-interpret HTML when removing unsupported links
    sanitizes math to LaTeX

Finished in 0.17323 seconds (files took 3.28 seconds to load)
16 examples, 0 failures

Randomized with seed 58854

```

observed 100% code coverage of `lib/sanitize_ext/sanitize_config.rb`.

Ran mastodon locally, and fetched [reference post][nyancat] and observed
that math was converted to plaintext form (and was not missing).

[FEP-dc88]: https://codeberg.org/fediverse/fep/src/branch/main/fep/dc88/fep-dc88.md
[tracking]: https://codeberg.org/fediverse/fep/issues/161
[socialhub]: https://socialhub.activitypub.rocks/t/fep-dc88-formatting-mathematics/3564
[nyancat]: https://nyan.network/notice/Aa4IvnBVHysWswRX1s

Related Discussion:
-------------------

Please see [FEP-dc88], the [FEP tracking issue][tracking] and
[FEP forum discussion][socialhub] for more information.

Fixes mastodon#26943
@ClearlyClaire ClearlyClaire added this pull request to the merge queue Nov 28, 2024
Merged via the queue into mastodon:main with commit 7f4858b Nov 28, 2024
27 checks passed
@christianp
Copy link

Thank you very much, everyone, for getting this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement FEP-dc88: Formatting Mathematics
6 participants