Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt Regex Escaping Confusion #86

Closed
s-weigand opened this issue Jul 12, 2020 · 4 comments · Fixed by #88
Closed

Prompt Regex Escaping Confusion #86

s-weigand opened this issue Jul 12, 2020 · 4 comments · Fixed by #88

Comments

@s-weigand
Copy link
Contributor

s-weigand commented Jul 12, 2020

First of all thanks for the new Prompt RegEx feature ❤️
I did just try it out and after some fiddeling around with the escaping, it works like a charm.

Expected behaviour

First of all I think in some edge cases it might make a difference to mention that the RegEx you type in your config.py is a JS RegEx and not a python one.
So that said IMHO it should work similar to this:

copybutton_prompt_text = r"<valid_js_reg_ex> "
copybutton_prompt_is_regexp = True

🌈 🦄

What I did

In my case the prompt I wanted to get rid of was of the form

r"In \[\d*\]: |\.\.\.: "

which is the jupyter console style and pretty much the same as the iPython style from the docs

"\\[\\d*\\]: |\\.\\.\\.: "

So it should be straight forward

"In \\[\\d*\\]: |\\.\\.\\.: "

as in the docs and I would be done?

But it didn"t work, so I did some digging and ended up with:

"In \\[\\d*\\]: |\\.\\.\\.: |\\$ "

What internally happens

In my config.py, I set copybutton_prompt_text

copybutton_prompt_text = r"In \[\d*\]: |\.\.\.: "

which is the same as:

copybutton_prompt_text = "In \\[\\d*\\]: |\\.\\.\\.: "
>>> r"\\" == "\\\\"
True

Sphinx now replaces the context variable copybutton_prompt_text in copybutton.js_t and generates copybutton.js.
But when we look at that string which should have just been passed as is, it is missing backslashes.

"In \[\d*\]: |\.\.\.: |\$ "

Well this looks like a totaly valid JS RegEx, so this is fine?
And it would be if written like this (JS just wants keep people on their toes, so why not a different character for RegEx strings?):

/In \[\d*\]: |\.\.\.: |\$ /

Now the string gets passed to the RegExp constuctor, which expects strings to be escaped like python, if you don"t use raw strings.
But it doesn"t match.

> "In [1]: bar".match(new RegExp("^(In \[\d*\]: |\.\.\.: )(.*)"))
null

Why is this going wrong and why didn"t the tests catch this?

TLDR

IMHO this all comes down ro JS RegExp "helping" you in some cases when it can interprete what you might have ment, which gives you false security (false positiv match in the tests).

Details

Disclaimer I"m in no way a JS expert with insight into the internals, I just found this due to WTF moments + trail and error.

> let proper_regex = new RegExp("^(>>> |\\$ |In \\[\\d*\\]: |\\[\\d*\\]: |\\.\\.\\.: )(.*)")
undefined
> let false_positive_regex = new RegExp("^(>>> |\\$ |In \[\d*\]: |\[\d*\]: |\.\.\.: )(.*)")
undefined
>"$ bar".match(proper_regex)
(3) ["$ bar", "$ ", "bar", index: 0, input: "$ bar", groups: undefined]
> "[1]: bar".match(proper_regex)
(3) ["[1]: bar", "[1]: ", "bar", index: 0, input: "[1]: bar", groups: undefined]
> "[1]: bar".match(false_positive_regex)
(3) ["[1]: bar", "[1]: ", "bar", index: 0, input: "[1]: bar", groups: undefined]
> "...: bar".match(proper_regex)
(3) ["...: bar", "...: ", "bar", index: 0, input: "...: bar", groups: undefined]
> "...: bar".match(false_positive_regex)
(3) ["...: bar", "...: ", "bar", index: 0, input: "...: bar", groups: undefined]

So far so good, all testcases work fine, as we would expect since they pass.

> "In [1]: bar".match(proper_regex)
(3) ["In [1]: bar", "In [1]: ", "bar", index: 0, input: "In [1]: bar", groups: undefined]
> "In [1]: bar".match(false_positive_regex)
null

The only explaination I got for this behaviour, is that JS wants to "help" the user by guessing it"s intention, like the famous adding thingy:

> "5" + 1
"51"

but at some points it is like "I don"t get it", and the pattern that worked before fails.

Possible fix

Use repr in add_to_context

If this line was changed to:

      {"copybutton_prompt_text": r"{}".format(repr(config.copybutton_prompt_text))[1: -1]}

I know it looks ugly, but would add extra backslashes for escaping (maybe someone else has a more elegant solution).
I"m not sure how much escapeRegExp would need to be adjusted to those changes.

I didn"t have luck with backslash replacing in JS:

> "In \[\d*\]: |\.\.\.: ".replace(/\\/, "\\\\");
"In [d*]: |...: "
> "In \[\d*\]: |\.\.\.: "
"In [d*]: |...: "

So in conclusion:
Python ❤️
JS 😠

P.S.: Wish we had the sphinx-copybutton on github, for the JS blobs I posted 😢

@choldgraf
Copy link
Member

hmmmm - a couple quick thoughts:

  • Maybe @chrisjsewell can advise on whether the regexp behavior you"re describing looks like a JS thing, or maybe it"s a Jinja thing? If it"s a Jinja thing then I think we can try to design around this in other ways.
  • @s-weigand would you be interested in making a PR to the docs with some "sample" regexes that will match very common patterns? E.g., the In[*] and Out[*] pattern? That could be helpful in saving people time in the future!

@s-weigand
Copy link
Contributor Author

I can make a PR at the weekend, maybe we even get the escaping thingy resolved by then. 😄

What definitely is a JS thing is that two different regexs match the same string.

Ha I just found tha jinja2 has the format filter builtin.

so changing:

return formatCopyText(target.innerText, "{{ copybutton_prompt_text }}", {{ copybutton_prompt_is_regexp | lower }}, {{ copybutton_only_copy_prompt_lines | lower }}, {{ copybutton_remove_prompts | lower }})

to:

  return formatCopyText(target.innerText, {{ "{!r}".format(copybutton_prompt_text) }}, {{ copybutton_prompt_is_regexp | lower }}, {{ copybutton_only_copy_prompt_lines | lower }},  {{ copybutton_remove_prompts | lower }})

Also does the trick, looks nicer and should be more reliable than my repr solution.
If this looks fine to you I can make the PR for docs and the template.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jul 13, 2020

Ha I just found tha jinja2 has the format filter builtin.

Ah cool didn"t notice that, yep that looks like it should improve the issue thanks @s-weigand, all them backslashes are a right pain

Hopefully, this will fix my issue here: #84 (comment)

@choldgraf
Copy link
Member

nice catch @s-weigand !

chrisjsewell added a commit that referenced this issue Jul 24, 2020
This fixes the way that the `copybutton_prompt_text` value is injected into `copybutton.js_t` (as discussed in #86).
The raw string formatting means that backslashes are now propagated correctly and removes the need for "double escaping".
Users will need to update their `copybutton_prompt_text` string accordingly.

Co-authored-by: Chris Sewell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants