-
Notifications
You must be signed in to change notification settings - Fork 32
Feature: Suggest patterns
As of v0.0.20, there was code to suggest dictionaries, it runs as a pass after all unrecognized words have been collected and then sees which words match which lists.
It made sense to be able to suggest patterns to a similar effect as check-spelling suggesting dictionaries.
There are a couple of ways to consider patterns:
This could be run after applying current patterns but before checking words against the dictionary.
This might scale poorly (each line of each file cross each pattern).
I'm experimenting with this in the v0.0.21 release.
After finding a word, in order to report the word's position in the unmasked line, the original line is rescanned. It would be possible to apply a mask at this point.
If a pattern is bad, it could introduce unrecognized words that aren't naturally present. - Suggestions should be curated, so this shouldn't be a real problem.
This turned out to be a bit of a problem which is partially addressed in v0.0.22 with the introduction of token-is-substring
.
This is an example of a pattern that is great once the corpus has been heavily filtered, but not great when applied too eagerly:
# version suffix <word>v#
[Vv]\d (?:\b|(?=[a-zA-Z_]))
This isn't a bad pattern, but it isn't useful as is:
# set arguments
\bset\s -[abefimouxE] \b
It would match:
set -e
, but set
is in the dictionary, and e
is too short to be checked. To be a useful pattern, it would need to be something more like:
# set arguments
\bset(?:\s -[abefimouxE]{1,2})*\s -[abefimouxE]{3,}(?:\s -[abefimouxE] )*
Prior to v0.0.22, applied pattern content was internally masked using =
. This unfortunately meant that a suggested pattern that looked for a =
might match on an active pattern even though the underlying content didn't have a =
and thus the pattern shouldn't apply.
patterns.txt
:
# Ignore any text between inline back-ticks
`(.{2,}?)`
candidate.patterns
:
>[-a-zA-Z=;:/0-9 ] =</
sdd.md
:
<a name="bufferSendIn">`bufferSendIn`</a>
The above would trigger this erroneous suggestion:
`Line` matches candidate pattern `>[-a-zA-Z=;:/0-9 ] =</` (candidate-pattern)
The way pattern suggestions are calculated has been improved such that these erroneous suggestions should not happen.