Page MenuHomePhabricator

Request increased quota for collection-alt-renderer Cloud VPS project
Closed, DeclinedPublic

Description

Project Name: collection-alt-renderer
Type of quota increase requested: 1 TByte of shared disc storage mountable into the project VMs
Reason: File cache to avoid CPU intensive recompilation of PDFs

Detailed Reason:

I am running the mediawiki2latex web service for converting wiki articles to downloadable formats.

https://mediawiki2latex.wmflabs.org/

The output files generated aim for high quality printing, and thus contain high resolution images (300 dpi).
This makes the task to create them computationally expensive. For single articles the effect is usually acceptable, but for collections of Wikipedia articles as for example found in the Wikipedia "Book:" namespace, the rendering time of a single Book is often a few hours.
It has been suggested to cache the created make them available for instant download. I have started to generate PDF files on an old dual core laptop (two at a time) of all books in the Wikipedia "Book:" namespace.
Currently received more than 1000 PDFs using 100GByte of space on disk after one month of compute time. Since there are about 6000 such books everything will safely fit into 1 TByte.
I would like to ask for one TByte of space that can be mounted into the mediawiki2latex VM, so I can make them available from web service for instant download.
Alternatively I could also host the files at home or and external provider, but I doubt that this is actually an option.

It is also relevant to this request that a decision to disable the books creator on the English Wikipedia was taken less than one hour after this request was filed and more than 3 month after the discussion on that had ended.

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_176#Suppress_rendering_of_Template:Wikipedia_books

Event Timeline

Peachey88 updated the task description. (Show Details)

@dhunniger In our current OpenStack deployment, 1TB of durable instance storage is not an easy thing to hand out. We are part way through a major storage project that should make it more possible for us to fulfill requests like this in the future, but that project is about 6 months from being usable in that way.

Today we do have NFS based storage that is used in the tools project (Toolforge), a separate but similar NFS system that is used by the maps project as a tile cache, and a third NFS cluster which can be used by other Cloud VPS projects on an opt-in basis for durable storage. By 'durable' I mean life beyond a Cloud VPS instance and/or shared access from multiple Cloud VPS instances. Would a slice of this NFS storage space fulfill your project's needs?

I read through https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_176#Suppress_rendering_of_Template:Wikipedia_books and https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_159#running_a_bot_to_upload_pdf_versions_of_Wikipedia_books_to_Wikipedia and I have to say that I ended up being more confused at the end than I thought I was at the start. Is your hope to create the 6000 PDF files and host them within your Cloud VPS project or is the ultimate goal to upload them to the Book namespace on enwiki? If the goal is the latter I am not completely sure I understand the need for 1TB of long term storage. Is it so that you do not have to fetch all of the images for each PDF again when making a second rendering with updated content? Is the main bottleneck in your PDF creation processes waiting for media to download?

Hi,

NFS storage is Ok me. If this is too hard for you I can easily host the storage myself, or by a provider which is only $10 per month for me. I am just afraid that some "no third party in horizon vm requirement" might disallow that (Like why are you routing requests from send by user to the wmf horizon vm to an external machine that is just under your personal control? And how is privacy of the users concerned?). But if this is no problem for you and it is hard for you to provided storage I am perfectly fine with doing it that way.

I want to make the files available for download for users of the English Wikipedia.

I have no preference on how this happens. Currently there is a link from each book on wikipedia to the mediawiki2latex VM and it works by re-rendering each time anew which is overusing the users patience. I want to speed this up by some kind of cache. I asked for space in the file name space on Wikipedia, but my request archived without decision. So I dropped that plan an now try to install a cache in the mediawiki2latex VM.

As for the bottlenecks in my current systems there are many. Currently 50% of the wall clock time is need to determine authors and contributors of texts and images, which is done by parsing all page history web pages (This could be speeded up by accessing the database replica instead). The download time of the media is less then 10%. But the processing of images in the LaTeX typesetting engine is quite slow, because each pixel has to be touched twice and I need to do four latex runs for proper table of contents and indices, which causes me to touch every pixel 8 times in total and of course to do png re-compression each time.

The storage does not need to be stored on raid and of course it does not need to be archived. I will be able to re upload from my backup any time. And of course it is Ok if you can provide the storage starting from August 2020, since I will need until then to prepare the PDFs anyway. And it also don't matter if you need a few more month.

And about re-rendering. Yes each I re-render I have to fetch and re-compress all images again. There is currently no cache in the system at all.

I asked for space in the file name space on Wikipedia, but my request archived without decision.

Is there a link to that request?

I want to make the files available for download for users of the English Wikipedia.

Another storage thought I just had: would these PDFs be a reasonable thing to host on Commons? Broadly speaking Commons is a platform optimized and maintained to manage storage and retrieval of media files including PDFs. The only thing I can think of that might run afoul of the Commons community's rules is if the PDFs will embed "fair use" images.

The link where I asked for space on Wikipedia is here:

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)/Archive_159#running_a_bot_to_upload_pdf_versions_of_Wikipedia_books_to_Wikipedia

Commons cannot be used due to fair use images. You are absolutely right about this point.

NFS storage is Ok me.
[...snip...]
And of course it is Ok if you can provide the storage starting from August 2020, since I will need until then to prepare the PDFs anyway. And it also don't matter if you need a few more month.

Right now I think that NFS is the best bet, but if the storage is not actually needed for another 7 months (good job thinking ahead!) I feel like the best thing to do today is close this task as 'Declined' and then revisit the available options closer to the date where you will actually have the need for the storage. At the very least NFS storage will be possible, but by that time we may have other storage options for Cloud VPS projects.

I am fine with closing the request now. I will refile it in 7 months

Closing this round, but we should revisit this need and grant either NFS access or "better" storage to the project when it is ready to use it.