Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.13.0 throws an Error::BacktrackLimitExceeded #138

Open
jcold opened this issue Jun 7, 2024 · 7 comments
Open

Version 0.13.0 throws an Error::BacktrackLimitExceeded #138

jcold opened this issue Jun 7, 2024 · 7 comments

Comments

@jcold
Copy link

jcold commented Jun 7, 2024

The following code snippet works fine in v0.12.0, but in v0.13.0 it throws an Error::BacktrackLimitExceeded. The regex crates have been checked and they are fine. Any assistance would be appreciated.

error message: Error executing regex: Max limit for backtracking count exceeded

#[test]
fn fancy() -> anyhow::Result<()> {
    let re = fancy_regex::Regex::new(r"\b\$?t\('(. ?)'\)")?;
    for m in re.captures_iter("content...") {
        if let Some(v) = m?.get(1) {
            println!("{}", v.as_str().to_string());
        }
    }
    Ok(())
}
@keith-hall
Copy link
Contributor

hmm, I tried to reproduce this with the toy example and wasn't able to:

cargo run --example toy -- run '\b\$?t\('\''(. ?)'\''\)' 'content...'
    Finished dev [unoptimized   debuginfo] target(s) in 0.03s
     Running `target/debug/examples/toy run '\b\$?t\('\''(. ?)'\''\)' content...`
no match

@Guillermogsjc
Copy link

Guillermogsjc commented Sep 9, 2024

Hi folks,

I'm also encountering this issue with complex, user-defined regex patterns (bad practices, horrible pattern), such as:

(a\w?).*?(b\w?).*?(c\w?).*?(d\w?).*?(e\w?).*?(f\w?).*?(g\w?).*?(h\w?).*?(i\w?).*?(j\w*).*?(k\w?).*?(l\w?)

The error RuntimeError(BacktrackLimitExceeded) frequently occurs when using fancy_regex==0.13.0, compared to 0.12.0.

I’m aware that the backtracking limit can be increased here: https://docs.rs/fancy-regex/latest/fancy_regex/struct.RegexBuilder.html#method.backtrack_limit, but the question is: why does this error happen much more often in this version than in the previous one? It seems like the internals are consuming significantly more resources or states.

I suspect this is related to the switch from the regex crate to the lower-level regex-automata. Perhaps something is missing in how greedy particles like .* are being handled.

https://github.com/fancy-regex/fancy-regex/blob/main/CHANGELOG.md#0130---2023-12-22

Switch from the regex crate to regex-automata and regex-syntax (lower-level APIs) to simplify internals (#121)

@keith-hall
Copy link
Contributor

Thanks for reporting with an additional example. I unfortunately have very little free time these days to investigate. I wonder if the different toy examples give different output in the version of fancy-regex before the regex-automata change compared to after, and whether that would give any clues as to the cause...

@robinst
Copy link
Contributor

robinst commented Oct 12, 2024

I had a look at the first example by @jcold. The difference seems to be caused by the \b for that one:

$ git switch --detach 0.12.0
$ cargo run --example toy -- trace '\b\$?t\('\''(. ?)'\''\)' 'content...'
    Finished dev [unoptimized   debuginfo] target(s) in 0.85s
    Running `target/debug/examples/toy trace '\b\$?t\('\''(. ?)'\''\)' content...`
pos	instruction
0	0 Delegate { inner: Regex("^\\b\\$?t\\('(. ?)'\\)"), inner1: Some(Regex("^(?s:.)\\b\\$?t\\('(. ?)'\\)")), start_group: 0, end_group: 1 }
fail

So 0.12.0 delegates the entire expression. Whereas 0.13.0 doesn't delegate the WordBoundary:

$ git switch --detach 0.13.0
$ cargo run --example toy -- trace '\b\$?t\('\''(. ?)'\''\)' 'content...'
    Finished dev [unoptimized   debuginfo] target(s) in 0.05s
     Running `target/debug/examples/toy trace '\b\$?t\('\''(. ?)'\''\)' content...`
pos	instruction
0	0 Assertion(WordBoundary)
0	1 Delegate ...
...

@ruihe774 Thoughts about this? I would probably go back to delegating word boundary matches again.

@ruihe774
Copy link
Contributor

I had a look at the first example by @jcold. The difference seems to be caused by the \b for that one:

$ git switch --detach 0.12.0
$ cargo run --example toy -- trace '\b\$?t\('\''(. ?)'\''\)' 'content...'
    Finished dev [unoptimized   debuginfo] target(s) in 0.85s
    Running `target/debug/examples/toy trace '\b\$?t\('\''(. ?)'\''\)' content...`
pos	instruction
0	0 Delegate { inner: Regex("^\\b\\$?t\\('(. ?)'\\)"), inner1: Some(Regex("^(?s:.)\\b\\$?t\\('(. ?)'\\)")), start_group: 0, end_group: 1 }
fail

So 0.12.0 delegates the entire expression. Whereas 0.13.0 doesn't delegate the WordBoundary:

$ git switch --detach 0.13.0
$ cargo run --example toy -- trace '\b\$?t\('\''(. ?)'\''\)' 'content...'
    Finished dev [unoptimized   debuginfo] target(s) in 0.05s
     Running `target/debug/examples/toy trace '\b\$?t\('\''(. ?)'\''\)' content...`
pos	instruction
0	0 Assertion(WordBoundary)
0	1 Delegate ...
...

@ruihe774 Thoughts about this? I would probably go back to delegating word boundary matches again.

Let me look into this.

@ruihe774
Copy link
Contributor

I'd suggest to increase the backtrack limit.

If we delegate the word boundary assertions to Regex, it will switch to use PikeVM, which is also a backtrack-based regex virtual machine and is slower than the default DFA/NFA implementation.

The total times of backtracking does not change. It is just that in previous implementation we delegated some backtracking to the underlying PikeVM of Regex, and we did not measure the times of backtracking in PikeVM. In current implementation, the backtracking is moved from PikeVM to the VM of fancy-regex, hence the measured times of backtracking increase.

There is no performance regression or memory usage regression.

@Guillermogsjc
Copy link

Guillermogsjc commented Oct 21, 2024

The total times of backtracking does not change. It is just that in previous implementation we delegated some backtracking to the underlying PikeVM of Regex, and we did not measure the times of backtracking in PikeVM. In current implementation, the backtracking is moved from PikeVM to the VM of fancy-regex, hence the measured times of backtracking increase.

Thats it, perfect, thanks for the explanation . Maybe warn on changelog to expect a need on increasing this limit that can be hitten more easily (as it counts more backtracks that were done under the hoods before) after upgrading to 0.13.0 from <0.13.0. It would reduce future noise on issues about this subject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants