Display rate limits information from OpenAI headers when returning 429 from their API #3743

oskapt · 2024-07-09T17:09:41Z

Is your feature request related to a problem? Please describe.
I'm getting a 429 error when using gpt-4o, but when I query the same endpoint with the same API key, the headers show that I'm not above my rate limits at all.

Describe the solution you'd like
If open-webui receives a 429 from OpenAI, include the header information that shows the current limit status and when they reset:

Rate Limit Headers:
x-ratelimit-limit-requests: 500
x-ratelimit-limit-tokens: 30000
x-ratelimit-remaining-requests: 499
x-ratelimit-remaining-tokens: 29996
x-ratelimit-reset-requests: 120ms
x-ratelimit-reset-tokens: 8ms

Describe alternatives you've considered
Short of running my own program to extract this information, I don't have any alternative. I can't see what OWUI is sending or why it would trigger a rate limit on the OpenAI side of things. It's been returning a 429 for over an hour, so my only recourse is to go back to using the ChatGPT webui to finish my project.

Additional context
This almost feels like a bug of some kind because I should get the same error when I make a query directly with the same API key to the same endpoint. I don't. Since OWUI is returning the response code from OpenAI, it would be helpful if it logged any additional information about what it's sending and what the response headers contained. Here's the script I wrote to query the current limits:

import requests

# Set your OpenAI API key
api_key = 'your-key-here'
api_url = 'https://api.openai.com/v1/chat/completions'

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

data = {
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 1
}

response = requests.post(api_url, headers=headers, json=data)

if response.ok:
    print("Rate Limit Headers:")
    for key, value in response.headers.items():
        if 'ratelimit' in key.lower():
            print(f"{key}: {value}")
else:
    print(f"Failed to retrieve rate limits: {response.status_code} - {response.text}")

The text was updated successfully, but these errors were encountered:

cheahjs · 2024-07-09T17:18:31Z

You're likely hitting the 30,000 tokens per minute rate limit, which will throw a 429 if your input max_tokens exceeds the TPM rate limit.

oskapt · 2024-07-09T20:13:32Z

I did some more testing, and I agree with this but am not sure why it's happening. If I start a new chat in owui, I don't get the 429. Why should a long chat get it? Is it sending the entire context/conversation back to openai with every request?

The exercise has been to create a script for a video. I had it do research with webpages I provided and then come up with an outline. We iterated on the outline for a bit, and then I had it write a script for each section of the outline. It got through section 5 before everything failed.

The only way that I can think that this would happen is if each request sent back something from the requests up to that point, which makes each subsequent request longer until I pop over the 30k TPM limit on a single request. At the same time, that sort of thing seems completely illogical because it would make API requests exponentially larger over time within the same chat window.

The request that fails simply says, "Generate the script for section 6," although it also fails after that if I say "hello." It's like the whole chat session is busted.

What am I missing?

cheahjs · 2024-07-09T20:23:58Z

The only way that I can think that this would happen is if each request sent back something from the requests up to that point, which makes each subsequent request longer until I pop over the 30k TPM limit on a single request.

That's how LLMs work, you have to send the entire conversation every time you want the model to output something.

Open WebUI currently does not truncate or summarise any previous messages. The context clip filter can be used to only retain the last n number of messages, and another potential strategy that hasn't been implemented is to have an LLM summarise the past messages.

oskapt · 2024-07-09T20:38:04Z

Hrm...ok. I was not aware of that. You can close this request if you don't think it's useful to implement. Thanks for taking the time to explain things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display rate limits information from OpenAI headers when returning 429 from their API #3743

Display rate limits information from OpenAI headers when returning 429 from their API #3743

oskapt commented Jul 9, 2024

cheahjs commented Jul 9, 2024

oskapt commented Jul 9, 2024 •

edited

Loading

cheahjs commented Jul 9, 2024

oskapt commented Jul 9, 2024

Display rate limits information from OpenAI headers when returning 429 from their API #3743

Display rate limits information from OpenAI headers when returning 429 from their API #3743

Comments

oskapt commented Jul 9, 2024

cheahjs commented Jul 9, 2024

oskapt commented Jul 9, 2024 • edited Loading

cheahjs commented Jul 9, 2024

oskapt commented Jul 9, 2024

oskapt commented Jul 9, 2024 •

edited

Loading