Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display rate limits information from OpenAI headers when returning 429 from their API #3743

Open
oskapt opened this issue Jul 9, 2024 · 4 comments

Comments

@oskapt
Copy link

oskapt commented Jul 9, 2024

Is your feature request related to a problem? Please describe.
I'm getting a 429 error when using gpt-4o, but when I query the same endpoint with the same API key, the headers show that I'm not above my rate limits at all.

Describe the solution you'd like
If open-webui receives a 429 from OpenAI, include the header information that shows the current limit status and when they reset:

Rate Limit Headers:
x-ratelimit-limit-requests: 500
x-ratelimit-limit-tokens: 30000
x-ratelimit-remaining-requests: 499
x-ratelimit-remaining-tokens: 29996
x-ratelimit-reset-requests: 120ms
x-ratelimit-reset-tokens: 8ms

Describe alternatives you've considered
Short of running my own program to extract this information, I don't have any alternative. I can't see what OWUI is sending or why it would trigger a rate limit on the OpenAI side of things. It's been returning a 429 for over an hour, so my only recourse is to go back to using the ChatGPT webui to finish my project.

Additional context
This almost feels like a bug of some kind because I should get the same error when I make a query directly with the same API key to the same endpoint. I don't. Since OWUI is returning the response code from OpenAI, it would be helpful if it logged any additional information about what it's sending and what the response headers contained. Here's the script I wrote to query the current limits:

import requests

# Set your OpenAI API key
api_key = 'your-key-here'
api_url = 'https://api.openai.com/v1/chat/completions'

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

data = {
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 1
}

response = requests.post(api_url, headers=headers, json=data)

if response.ok:
    print("Rate Limit Headers:")
    for key, value in response.headers.items():
        if 'ratelimit' in key.lower():
            print(f"{key}: {value}")
else:
    print(f"Failed to retrieve rate limits: {response.status_code} - {response.text}")
@cheahjs
Copy link
Contributor

cheahjs commented Jul 9, 2024

You're likely hitting the 30,000 tokens per minute rate limit, which will throw a 429 if your input max_tokens exceeds the TPM rate limit.

@oskapt
Copy link
Author

oskapt commented Jul 9, 2024

I did some more testing, and I agree with this but am not sure why it's happening. If I start a new chat in owui, I don't get the 429. Why should a long chat get it? Is it sending the entire context/conversation back to openai with every request?

The exercise has been to create a script for a video. I had it do research with webpages I provided and then come up with an outline. We iterated on the outline for a bit, and then I had it write a script for each section of the outline. It got through section 5 before everything failed.

The only way that I can think that this would happen is if each request sent back something from the requests up to that point, which makes each subsequent request longer until I pop over the 30k TPM limit on a single request. At the same time, that sort of thing seems completely illogical because it would make API requests exponentially larger over time within the same chat window.

The request that fails simply says, "Generate the script for section 6," although it also fails after that if I say "hello." It's like the whole chat session is busted.

What am I missing?

@cheahjs
Copy link
Contributor

cheahjs commented Jul 9, 2024

The only way that I can think that this would happen is if each request sent back something from the requests up to that point, which makes each subsequent request longer until I pop over the 30k TPM limit on a single request.

That's how LLMs work, you have to send the entire conversation every time you want the model to output something.

Open WebUI currently does not truncate or summarise any previous messages. The context clip filter can be used to only retain the last n number of messages, and another potential strategy that hasn't been implemented is to have an LLM summarise the past messages.

@oskapt
Copy link
Author

oskapt commented Jul 9, 2024

Hrm...ok. I was not aware of that. You can close this request if you don't think it's useful to implement. Thanks for taking the time to explain things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants