Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: smart context length managment #1268

Closed
tjbck opened this issue Mar 22, 2024 · 4 comments
Closed

feat: smart context length managment #1268

tjbck opened this issue Mar 22, 2024 · 4 comments
Labels
core core feature enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@tjbck
Copy link
Contributor

tjbck commented Mar 22, 2024

e.g. messages.length > 10, slice

@tjbck tjbck added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed core core feature labels Mar 24, 2024
@lainedfles
Copy link
Contributor

lainedfles commented Mar 26, 2024

Great idea. I think it would be beneficial to cache this litellm file anyway which contains useful information including max_tokens:
https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json

While it may not be useful for local Ollama models, the Ollama Modelfile syntax supports the num_ctx parameter and can be queried via the API. A good strategy may be to leverage the litellm JSON data for external models like OpenAI and presume that all Ollama context length is the Ollama default of 2048 unless otherwise determined by the Modelfile parameter for a given model.

Although it's still beneficial to retain configurability for cases where you don't require the max context or if the information is absent.

@VertexMachine
Copy link

Let me add to this some more information. This is very needed if you use APIs, and not only OpenAI API, but others like OpenRouter or Infermatics. Some models endpoints just fail when you exceed the context length (returning error 400), some will incur massive cost for the user (as it grows with context size, and turncation might be a good option in those cases). Unfortunately, the problem is that there is no standardized tokenization endpoint defined in OpenAI compatibile API. OpenAI recommend using https://github.com/openai/tiktoken on client side.

As a workaround I do use AutoTokenizer from Transformers (from transformers import AutoTokenizer ) to calculate token count in my apps. This is the function I've written (feel free to incorporate it in your code):

    def get_token_count(self, prompt: str, raw: bool = False) -> int:
        """
        Get the token count of the given prompt.

        If raw is True than we don't count BOS and EOS tokens.
        """
        if self.tokenizer is None:
            raise ValueError("Tokenizer is not selected")

        return len(self.tokenizer.encode(prompt, add_special_tokens=False)) if raw else len(self.tokenizer.encode(prompt))   1

I made it generic as sometimes I don't want to have BOS/EOS token counted (hf.tokniezers by default add BOS, but not EOS).

There are a few issues here:

  • one have to create appropriate tokenizer, and a lot of models out there have different ones (as even fine tunes can add/remove some tokens). The good news is, that if you know the model name from HF, tokenizer created from it, eg. , tokenizer = AutoTokenizer.from_pretrained("model_name", legacy=False) will download appropriate files.
  • Another bad news is that for example for use of official Meta's Llama3 tokenizer from their repo, you have to agree to their terms on HF and wait a bit, so you should handle fall backs or direct user to it as well.
  • Also you have to know the model name, which is often different than the name from API endpoint. To quickly workaround I created such mapping for my apps:
TOKENIZER_MAP = [
    # TotalGPT/Infermatic.ai models
    ("Midnight-Miqu-70B-v1.5", "sophosympatheia/Midnight-Miqu-70B-v1.5"),
    ("CodeLlama-13b-Instruct-hf", "codellama/CodeLlama-13b-Instruct-hf"),
    ("MiquMaid-v3-70B", "NeverSleep/MiquMaid-v3-70B"),
    ("UNA-SimpleSmaug-34b-v1beta", "fblgit/UNA-SimpleSmaug-34b-v1beta"),
    ("L3-MS-Astoria-70b", "Steelskull/L3-MS-Astoria-70b"),
    ("Mixtral-8x7B-Instruct-v0.1", "mistralai/Mixtral-8x7B-Instruct-v0.1"),
    ("miquliz-120b-v2.0", "wolfram/miquliz-120b-v2.0"),
    ("Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss", "NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss"),
    ("Smaug-Llama-3-70B-Instruct", "abacusai/Smaug-Llama-3-70B-Instruct"),
    # OpenRouter models
    ("mistralai/mistral-7b-instruct:free", "mistralai/Mistral-7B-Instruct-v0.1"),
    ("alpindale/goliath-120b", "alpindale/goliath-120b"),
    ("sao10k/fimbulvetr-11b-v2", "Sao10K/Fimbulvetr-11B-v2"),
    ("cognitivecomputations/dolphin-mixtral-8x7b", "cognitivecomputations/dolphin-2.6-mixtral-8x7b"),
    ("cohere/command-r", "CohereForAI/c4ai-command-r-v01"),
    ("cohere/command-r-plus", "CohereForAI/c4ai-command-r-plus"),
    ("meta-llama/llama-3-70b-instruct", "Undi95/Meta-Llama-3-8B-hf"),
    ("neversleep/llama-3-lumimaid-70b", "NeverSleep/Llama-3-Lumimaid-70B-v0.1"),
    # Generic fallback guesses
    ("8x22B", "mistralai/Mixtral-8x22B-v0.1"),
    ("llama-3", "Undi95/Meta-Llama-3-8B-hf"),
    ("l3", "Undi95/Meta-Llama-3-8B-hf")
]

Also, feel free to use the above mapping as a starting point. As a last effort fallback I'm using simply gpt2 tokenizer AutoTokenizer.from_pretrained("gpt2").

@tjbck
Copy link
Contributor Author

tjbck commented Jun 19, 2024

Filter function from #3247 will resolved this. You can essentially write your own custom middleware and install it with functions.

@tjbck
Copy link
Contributor Author

tjbck commented Jun 30, 2024

https://openwebui.com/f/hub/context_clip_filter

Feedback wanted here!

@open-webui open-webui locked and limited conversation to collaborators Jun 30, 2024
@tjbck tjbck converted this issue into discussion #3537 Jun 30, 2024
@tjbck tjbck unpinned this issue Jul 18, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
core core feature enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants