-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: smart context length managment #1268
Comments
Great idea. I think it would be beneficial to cache this litellm file anyway which contains useful information including max_tokens: While it may not be useful for local Ollama models, the Ollama Modelfile syntax supports the num_ctx parameter and can be queried via the API. A good strategy may be to leverage the litellm JSON data for external models like OpenAI and presume that all Ollama context length is the Ollama default of Although it's still beneficial to retain configurability for cases where you don't require the max context or if the information is absent. |
Let me add to this some more information. This is very needed if you use APIs, and not only OpenAI API, but others like OpenRouter or Infermatics. Some models endpoints just fail when you exceed the context length (returning error 400), some will incur massive cost for the user (as it grows with context size, and turncation might be a good option in those cases). Unfortunately, the problem is that there is no standardized tokenization endpoint defined in OpenAI compatibile API. OpenAI recommend using https://github.com/openai/tiktoken on client side. As a workaround I do use AutoTokenizer from Transformers ( def get_token_count(self, prompt: str, raw: bool = False) -> int:
"""
Get the token count of the given prompt.
If raw is True than we don't count BOS and EOS tokens.
"""
if self.tokenizer is None:
raise ValueError("Tokenizer is not selected")
return len(self.tokenizer.encode(prompt, add_special_tokens=False)) if raw else len(self.tokenizer.encode(prompt)) 1 I made it generic as sometimes I don't want to have BOS/EOS token counted (hf.tokniezers by default add BOS, but not EOS). There are a few issues here:
TOKENIZER_MAP = [
# TotalGPT/Infermatic.ai models
("Midnight-Miqu-70B-v1.5", "sophosympatheia/Midnight-Miqu-70B-v1.5"),
("CodeLlama-13b-Instruct-hf", "codellama/CodeLlama-13b-Instruct-hf"),
("MiquMaid-v3-70B", "NeverSleep/MiquMaid-v3-70B"),
("UNA-SimpleSmaug-34b-v1beta", "fblgit/UNA-SimpleSmaug-34b-v1beta"),
("L3-MS-Astoria-70b", "Steelskull/L3-MS-Astoria-70b"),
("Mixtral-8x7B-Instruct-v0.1", "mistralai/Mixtral-8x7B-Instruct-v0.1"),
("miquliz-120b-v2.0", "wolfram/miquliz-120b-v2.0"),
("Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss", "NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss"),
("Smaug-Llama-3-70B-Instruct", "abacusai/Smaug-Llama-3-70B-Instruct"),
# OpenRouter models
("mistralai/mistral-7b-instruct:free", "mistralai/Mistral-7B-Instruct-v0.1"),
("alpindale/goliath-120b", "alpindale/goliath-120b"),
("sao10k/fimbulvetr-11b-v2", "Sao10K/Fimbulvetr-11B-v2"),
("cognitivecomputations/dolphin-mixtral-8x7b", "cognitivecomputations/dolphin-2.6-mixtral-8x7b"),
("cohere/command-r", "CohereForAI/c4ai-command-r-v01"),
("cohere/command-r-plus", "CohereForAI/c4ai-command-r-plus"),
("meta-llama/llama-3-70b-instruct", "Undi95/Meta-Llama-3-8B-hf"),
("neversleep/llama-3-lumimaid-70b", "NeverSleep/Llama-3-Lumimaid-70B-v0.1"),
# Generic fallback guesses
("8x22B", "mistralai/Mixtral-8x22B-v0.1"),
("llama-3", "Undi95/Meta-Llama-3-8B-hf"),
("l3", "Undi95/Meta-Llama-3-8B-hf")
] Also, feel free to use the above mapping as a starting point. As a last effort fallback I'm using simply gpt2 tokenizer |
Filter function from #3247 will resolved this. You can essentially write your own custom middleware and install it with functions. |
https://openwebui.com/f/hub/context_clip_filter Feedback wanted here! |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
e.g. messages.length > 10, slice
The text was updated successfully, but these errors were encountered: