Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: customize behavior with hooks #4980

Open
4 tasks done
flynneva opened this issue Feb 2, 2023 · 8 comments
Open
4 tasks done

Search: customize behavior with hooks #4980

flynneva opened this issue Feb 2, 2023 · 8 comments
Labels
change request Issue requests a new feature or improvement

Comments

@flynneva
Copy link

flynneva commented Feb 2, 2023

Context

Since v9.0.0 customizing the search tokenization has been limited to modifying the separator keyword in the config (removed support for custom search transform functions).

Also since v9.0.0 we are able to specify a choice of 3 different lunr.js pipeline functions (stemmer, stopWordFilter, and trimmer) to be able to modify the search pipeline.

Description

Instead of just limiting us to only those 3 options (stemmer, stopWordFilter, and trimmer), we should be able to specify our own pipeline function (ideally just like how we do so with emojis).

So something like this is what I had in mind:

search:
   separator: "\s"
    pipeline:
      - stemmer
      - trimmer
      - stopWordFilter
      - !!python/name:my_cool_package.search.my_cool_pipeline_function

Not sure if that is allowed or if we would be required to write it in ts or js, but you get the idea.

Related links

Use Cases

  • customizing search pipelines to fit different use cases
  • customizing tokenization for different languages
  • etc.

Visuals

No response

Before submitting

@squidfunk
Copy link
Owner

squidfunk commented Feb 3, 2023

Thanks for suggesting. Could you please explain your desired outcome? "Customizing to fit different use cases" is too broad to be actionable. Also, customizing the search tokenizer for different languages can be implemented by using different separators for two different sites. In the discussion you linked, you mentioned:

For example on the mkdocs-material site if you search for test::search I'd expect it to return results about the two tokens (test and search) assuming the regex separator was applied to the query.

This was reported and fixed in #4884 and released in 9.0.7. The functionality for search transformation was removed because the current approach was not general enough and did not scale well. I'm happy to add a new way extension or transformation hook that allows to intercept query (and maybe results) and alter them before returning them, but we need to collect some use cases before tackling that.

@squidfunk squidfunk added change request Issue requests a new feature or improvement needs input Issue needs further input by the reporter labels Feb 3, 2023
@flynneva
Copy link
Author

flynneva commented Feb 6, 2023

Could you please explain your desired outcome?

@squidfunk sure: be able to modify the lunr.js pipeline to fit other use cases not covered by the default search pipeline - like this one: Two stage tokenization to add full strings to the index.

customizing the search tokenizer for different languages can be implemented by using different separators for two different sites

@squidfunk I don't think you can add tokens found that are separated by spaces and tokens found that follow some separator regex (two-stage tokenization) using the current separator approach, right?

I'm happy to add a new way extension or transformation hook that allows to intercept query (and maybe results)

@squidfunk I'm confused by this statement. This is the "old way" of doing it. The "new way" would be to take advantage of the features that lunr.js provides (pipelines)...right? Not sure of the effort involved to enable us to customize our own pipelines, but that is what this issue is asking for.

@squidfunk
Copy link
Owner

I'm confused by this statement. This is the "old way" of doing it. The "new way" would be to take advantage of the features that lunr.js provides (pipelines)...right? Not sure of the effort involved to enable us to customize our own pipelines, but that is what this issue is asking for.

I may have formulated a bit badly – yes, it's the old way, but I was talking about rethinking that process, as it only allowed for query transformation and nothing else. Transformation is now done as part of the worker, so maybe we could provide hooks to hook into different parts of the search index, possibly exposing one hook to alter Lunr.js before starting to index documents. This would effectively allow to implement own pipeline functions with which you should achieve what you we're aiming for when we talked about the two-stage tokenization approach.

@squidfunk
Copy link
Owner

squidfunk commented Feb 7, 2023

To expand on that: we moved transformation into the worker, so the worker is completely self-contained, i.e., defines all behavior. This makes integration of third-party search solutions simpler, as the application itself will apply no processing to the query before sending it to the worker. Before, query transformation was done in the application, then sent to the worker.

All of this made the current approach unfeasible, since it involves defining a function in the global scope that is called by the application if defined. We need a new approach for search transformation / extension, but before I started working on that I wanted to verify that this is still something that is needed 😊 We'll add it back shortly. If you have other ideas that we should consider and requirements we need to fulfill, please share them here. So far we collected:

  • Transform the query before searching (e.g. fooBar -> foobar foo bar)
  • Register custom pipeline functions for expanding or filtering tokens before indexing
  • Add a "Hello World" guide on how to write a custom pipeline function

@squidfunk squidfunk changed the title Modifiable search pipelines with custom pipeline function Add customization hooks for search to alter behavior Feb 7, 2023
@squidfunk squidfunk removed the needs input Issue needs further input by the reporter label Feb 7, 2023
@flynneva
Copy link
Author

flynneva commented Feb 8, 2023

Transform the query before searching (e.g. fooBar -> foobar foo bar)

@squidfunk so according to the lunr.js docs I think pipelines do this as well, no?

From the lunr.js docs:

lunr.Pipelines maintain an ordered list of functions to be applied to all tokens in documents entering the search index and queries being ran against the index.

So I think only point 2 and 3 would need to be implemented:

  • Register custom pipeline functions for expanding or filtering tokens before indexing
  • Add a "Hello World" guide on how to write a custom pipeline function

And point 3 you might just be able to link to the lunrjs docs like you do for the pymdownx stuff

@squidfunk
Copy link
Owner

squidfunk commented Feb 8, 2023

I'm not sure if pipelines allow to change the entirety of the syntax, that is Lunr.js field references and operators for boosting, as well as inclusion and exclusion. I think pipelines will only allow to remove, replace, expand or add tokens. Thus, I believe that in the following query, only the terms in brackets are moved through the pipeline:

 title:[fooBar]* [fooBar]^2

This would not allow to split/replace meta characters or introduce additional prefix or suffix wildcards. However, more research is needed. If you wish to dig into this, it'll be awesome to get some intel. Otherwise, I'll do that later.

@squidfunk squidfunk changed the title Add customization hooks for search to alter behavior Search: customize behavior with hooks Aug 10, 2023
@squidfunk
Copy link
Owner

Please see the announcement in #6307.

@squidfunk squidfunk reopened this Nov 7, 2023
@squidfunk
Copy link
Owner

I've reopened #6632 which specifically requests to make PascalCase searchable as PascalCase, pascalcase and case – a shortcoming of the current implementation that was reported several times. I'm confident that this will make it into the next iteration of search, as I was able to quickly throw together a prototype. I'm leaving this issue open, since we also want to allow users to easily change the behavior of search with custom hooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change request Issue requests a new feature or improvement
Projects
None yet
Development

No branches or pull requests

2 participants