Page MenuHomePhabricator

Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC
Open, Needs TriagePublic

Description

I am encountering an increased amount of user-facing 429 responses while trying to contribute on both the French and the Esperanto Wikisource.

The issue seems similar to the one encountered in T337649, with similarly increased amounts of 429 errors emitted by codfw for Thumbor jobs.

See this grafana chart: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&from=1723467291972&to=1723640091972&viewPanel=62 (warning: the scale is logarithmic).

Event Timeline

Aklapper renamed this task from Elevated 429 responses from Thumbor on codfw starting August 14th 00:00 UTC to Elevated 429 responses from Thumbor on codfw starting 2024-08-14 00:00 UTC.Aug 14 2024, 1:07 PM

I'm investigating this issue, but it does not appear to be the same issue as T337649 on first glance. Restarts and release of locks does not immediately drop the throttle rate, and a great deal of the elevated 429 errors are legitimate response to aggressive scraping (of course, that doesn't mean _all_ are legitimate errors)

Is this issue currently still affecting wikisource users?

Yup. I'm still encountering 429s when contributing and have other contributors reporting me the same issue.

To better trace this issue, could I get a sample of failing URLs please?

Change #1063217 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] poolcounter: introduce allowlist to skip rate limit

https://gerrit.wikimedia.org/r/1063217

Change #1063228 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: add allowlist to thumbor to address internal rate limiting

https://gerrit.wikimedia.org/r/1063228

hnowlan changed the task status from Open to Stalled.Aug 23 2024, 11:32 AM

This appears to have dropped. Leaving open to get patches resolved at a later point

Almost no new PDF file thumbnails can be generated from codfw: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype[]=OFFICE&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset=
The situation is a lot better when switching to eqiad. But the thumbnails rendered in eqiad somehow aren't shared with codfw, so when switching back to codfw, the thumbnails are still not loading.

Almost no new PDF file thumbnails can be generated from codfw: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype[]=OFFICE&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset=
The situation is a lot better when switching to eqiad. But the thumbnails rendered in eqiad somehow aren't shared with codfw, so when switching back to codfw, the thumbnails are still not loading.

I've disabled a throttling feature that might be impacting this behaviour, I see format based throttling has stopped so this should no longer be an issue.

found T376509 while investigating 429 for https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Feedback_form_Odia_Wikipedia_outreach.pdf/page1-180px-Feedback_form_Odia_Wikipedia_outreach.pdf.jpg

https://commons.wikimedia.org/wiki/File:Feedback_form_Odia_Wikipedia_outreach.pdf

  • 463px (embedded by default in description page) seems fine but I guess that's some kind of cache hit?
  • 180px (embedded in Special:ListFiles) is 429
  • other sizes linked from description page are ok.
  • other sizes I pulled out of thin air also don't work.

ahhhhh, now I found T372470#10113572.

Change #1078043 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: disable expensive counter

https://gerrit.wikimedia.org/r/1078043

hnowlan claimed this task.

found T376509 while investigating 429 for https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Feedback_form_Odia_Wikipedia_outreach.pdf/page1-180px-Feedback_form_Odia_Wikipedia_outreach.pdf.jpg

https://commons.wikimedia.org/wiki/File:Feedback_form_Odia_Wikipedia_outreach.pdf

  • 463px (embedded by default in description page) seems fine but I guess that's some kind of cache hit?
  • 180px (embedded in Special:ListFiles) is 429
  • other sizes linked from description page are ok.
  • other sizes I pulled out of thin air also don't work.

ahhhhh, now I found T372470#10113572.

It appears that there was an underlying issue with a rate limiting service used by thumbor. I've applied a fix to work around this for the short-term and that PDF now renders.

No longer a problem for me at Wikisource, so I think whatever hnowlan did fixed it.

I'm seeing recoveries on most of the linked images, but reopening this until we're sure this is resolved.

Related for me to follow-up on, these 429 errors are incorrect and should be 503s. Related: T175512, T353950

Just to explain the issue - a while ago a rate-limiting feature that was known to be problematic was reenabled in an emergency due to a harmful surge in traffic. This was left enabled and caused this issue to recur. I've since disabled this feature and we'll be removing it to prevent it being erroneously triggered again. However, the fact that this required manual reporting and wasn't noticed on the SRE-side isn't really acceptable so next week I'll be working on adding per-format alerting so that if there is an increase in errors for a single format we'll catch these before they can have a wide impact which will be tracked in T376538.

Can confirm recovery occuring at English Wikisoruce on files mentioned previously, eslewhere.

Change #1078043 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: disable expensive counter

https://gerrit.wikimedia.org/r/1078043