Page MenuHomePhabricator

Download to PDF: HTTP 500 error on some wikis for some users
Closed, ResolvedPublicBUG REPORT

Description

"Download to PDF" on en.wv is returning error: "{"name":"HTTPError","message":"500","status":500,"detail":"Internal Server Error"}"

  • Client-Side workaround ---

As many users have been referred here, a potential workaround some clients may use is:

  1. Install a virtual printer that creates PDF documents on your client
  2. Use your browser Print function
  3. Select the virtual printer from (1) above
  4. Use the virtual printer save function

Event Timeline

Koavf's link works for me.

Even when you click on "Download"? The *page* renders properly, but the *PDF* gives the HTTP error.

Yes. Here's the PDF it gave me.

@Jtneill: Thanks for reporting this. For future reference, please use the bug report form (linked from the top of the task creation page) to create a bug report, and fill in all the sections in the template instead of deleting them, to avoid followup questions for more information and examples. Thanks.

If this is about "Download as PDF" on https://en.wikiversity.org/wiki/Motivation_and_emotion/Book/2024/Dopamine_and_social_behaviour , then it works for me...

Aklapper renamed this task from Download to PDF - Error on en.wv to Download to PDF: HTTP 500 error on en.wikiversity for some folks.Oct 4 2024, 6:31 AM

@Aklapper Thanks for testing and clarifying about bug reporting.

To confirm, I tested 3 browsers on a Windows 10 device and each gave a HTTP 500 error when clicking "Download" from https://en.wikiversity.org/w/index.php?title=Special:DownloadAsPdf&page=Motivation_and_emotion/Book/2024/Dopamine_and_social_behaviour&action=show-download-screen:

  • Google Chrome
  • Mozilla Firefox
  • Microsoft Edge

However, download pdf works using Chrome on an Android device.

A reader reported similar problem on en.wikipedia in ticket 2024100510006742

"
{"name": "HTTPError", "message":"500", "status":500, "detail":"Internal Server Error"}

I'm using macOS Sequoia 15.0 on my 2019 MacBook Pro. I get the same results using either Google Chrome browser, Safari browser or Arc browser.
"

I was unable to duplicate.

I tried to download the same article PDF on zhwiki through several VPN IP addresses. I found that I could download it from some regions (e.g. Germany) but not from others (e.g. Hong Kong). Even after I switched to using another browser, I still couldn't download the PDF.

I can share which IPs could or couldn't download the PDF privately if anyone needs that information. Perhaps data centers in some regions have the issue?

SCP-2000 renamed this task from Download to PDF: HTTP 500 error on en.wikiversity for some folks to Download to PDF: HTTP 500 error on some wikis for some folks.Oct 8 2024, 8:38 AM

Users are continuing to report this problem:

Screenshot_2024-10-11_104037_Wiki_Unable2Download.png (341×1 px, 16 KB)

Xaosflux renamed this task from Download to PDF: HTTP 500 error on some wikis for some folks to Download to PDF: HTTP 500 error on some wikis for some users.Oct 11 2024, 6:15 PM
TheDJ subscribed.

I also experience it for https://en.wikiversity.org/api/rest_v1/page/pdf/Motivation_and_emotion/Book/2024/Dopamine_and_social_behaviour

I don't see anything obvious in https://grafana.wikimedia.org/d/U4TuF-lMk/proton?orgId=1

The response headers list x-cache: cp3069 miss, cp3069 miss

Logstash has lots of spawn process errors:

Error: Failed to launch the browser process!
/usr/bin/chromium: 5: /etc/chromium.d/extensions: Cannot fork
/usr/bin/chromium: 121: Cannot fork
/usr/bin/chromium: 123: Cannot fork
[826540:826540:1015/085235.180776:FATAL:spawn_subprocess.cc(237)] posix_spawn /usr/lib/chromium/chrome_crashpad_handler: Resource temporarily unavailable (11)

https://logstash.wikimedia.org/goto/ff20ff6a1109dc85f8a0fcd7cc2d9c0b

Screenshot 2024-10-15 at 10.53.45.png (152×1 px, 21 KB)

This appears to be a rerun of T375521 - temporary fix last time was a roll restart, but there's clearly a deeper issue.

I suspect that the issue is that we don't close or somehow we end up in a sitation with stale browser instances. Given the level of traffic/support of the pdf service would it be enough to just restart the service ?

Users continue to report this problem, such as in otrs ticket 2024101810000205. I was also able to reproduce this issue myself.

I suspect that the issue is that we don't close or somehow we end up in a sitation with stale browser instances. Given the level of traffic/support of the pdf service would it be enough to just restart the service ?

Works for me - the service can be restarted using the standard roll_restart pattern. We might see this come back soon, as this happened last month also.

Chromium is leaking processes, leaving chromium_crashpads lying around after a failure most likey:

root@wikikube-worker2070:/home/hnowlan# ps uax| grep chrome_crashpad | wc -l
115357

Done, I will keep an eye on the logs.

For what is worth, I also update the dashboard at https://grafana-rw.wikimedia.org/d/U4TuF-lMk/proton?orgId=1 to allow querying both DCs at once as well as selectively and fixed the per container saturation panels that were broken. I also removed the nodejs garbage collection panels as those metrics aren't being emitted anymore (support has been dropped at the service runner level IIRC)

Here are some people with similar experiences. A suggestion was to run with --single-process --disable-crashpad-for-testing (because crashpad disabling does not propagate to subprocesses)

Maybe this is related? https://github.com/GoogleChromeLabs/chrome-for-testing/issues/114

Errors are clearly spiking again

https://grafana.wikimedia.org/d/U4TuF-lMk/proton?orgId=1&from=1730305791822&to=1730910591822

IMG_2006.png (2×1 px, 515 KB)

(I’d open a separate ticket, but im on my phone)

TheDJ triaged this task as Unbreak Now! priority.Wed, Nov 6, 4:31 PM

Looks like the same crashpad flood issue again. The service needs a restart, and I think we should implement the flags @TheDJ has mentioned.

I run a rolling restart in k8s. Regarding the chromium parameters we already pass the --single-process. I am taking a look at what the other parameter does in detail.

jijiki moved this task from Doing 😎 to serviceops-radar on the serviceops board.
jijiki edited projects, added serviceops-radar; removed serviceops.

Change #1088271 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] chromium-render: Add cli flag to avoid flooding with crashpad processes

https://gerrit.wikimedia.org/r/1088271

hmm. there is still some failures occurring above the average..

Screenshot 2024-11-08 at 10.14.42.png (536×1 px, 208 KB)

The error rate is quickly increasing again:

Screenshot 2024-11-19 at 14.37.52.png (554×1 px, 207 KB)

Change #1088271 merged by jenkins-bot:

[operations/deployment-charts@master] chromium-render: Add cli flag to avoid flooding with crashpad processes

https://gerrit.wikimedia.org/r/1088271

I got this on enwiki w/ Chromium v126 (Falklands War)

image.png (768×1 px, 21 KB)

Reloading https://en.wikipedia.org/api/rest_v1/page/pdf/Falklands_War seemed to work.

@ihurbain just deployed the crashpad flag flip patch and (at least for now) Proton looks happier.

There have not been significant 5xx errors for 7 days now. Calling this fixed until proven otherwise, Thanks everyone !