Page MenuHomePhabricator

Back up of Commons files
Closed, ResolvedPublic

Description

Proposed in Community-Wishlist-Survey-2016. Received 28 support votes, and ranked #54 out of 265 proposals. View full proposal with discussion and votes here.

Problem

Because of various software bugs, misconfiguration or software interactions sometimes various files are lost from Wikimedia Commons. Sometimes they are restored later, but generally after a long, unpredictable period of time. In many cases they are never restored. Sometimes the files seem to be permanently lost or just nobody knows how they can be restored. In many cases it is not easy to reupload them from other sources as the files were modified/created just for use in other Wikimedia wikis and are not stored elsewhere.

Who would benefit

Wikimedia Commons (and other wikis) users who use the files. They will find Wikimedia Commons as more reliable file storage.

Proposed solution

Create a continuous backup of all uploaded files that would allow file restoring by devs in a predictable period of time (few days? a week?) on community requests.

Technical details

Time, expertise and skills required

  • e.g. 2-3 weeks, advanced contributor, javascript, css, etc

Suitable for

  • e.g. Hackathon, GSOC, Outreachy, etc.

Proposer

Ankry

Related links

Event Timeline

Poyekhali renamed this task from Back up of common files to Back up of Commons files.Mar 11 2017, 2:37 AM
Poyekhali subscribed.

I'd be willing to help with backing it up

I have no idea what this request wants from ArchiveTeam/WikiTeam that we aren't already doing (http://archiveteam.org/index.php?title=Wikimedia_Commons ), so I'm removing our tag.

For the WMF side of things, I guess it would be useful to expand/update https://wikitech.wikimedia.org/wiki/Bacula to clarify what is actually being backed up, how easy it is to recover data from the backups and how the backups can be expanded to cover more things (if uploads aren't covered yet).

Seeing bacula being mentioned (I indeed should try to update the wikitech page, although not many things have changed), I just want to point out that bacula is not designed for this kind of thing, for multiple reasons. A few are:

  • The design model of bacula which focuses on the backing up of file on a predetermined time based schedule.
  • The point in time nature of these backups.
  • The fact that we don't have in the current infrastructure enough space to handle this.

So the bacula infrastructure is not suitable for this.

I am interested a bit on this statement Because of various software bugs, misconfiguration or software interactions. Are these issues identified in the "server" side of things or are we talking about client-side bugs. For the former (the 4 linked ones in the description of this task fall in this category), we should probably fix these issues, for the later, aside from the accidental deletion case, I am failing to see how server side backups would help much.

Are these issues identified in the "server" side of things

Yes, mostly issues with Swift.

I am wondering how is this task related to "skills required: javascript, css, etc" at all?

I am wondering how is this task related to "skills required: javascript, css, etc" at all?

Apparently this task was filed according to a standard format under the assumption that some volunteer or intern could potentially do something about it. If it's in WMF operations realm, then the task description should be adapted.

I am wondering how is this task related to "skills required: javascript, css, etc" at all?

Apparently this task was filed according to a standard format under the assumption that some volunteer or intern could potentially do something about it. If it's in WMF operations realm, then the task description should be adapted.

I agree.

Creating a backup of commons is a huge infrastructural task that would require significant commitment in terms of design and hardware. It would likely be not just operations, but a significant develpment time too, surely not a hackathon-like project at all.

Also, before anyone embarks in such a project, I think we should seek hard data about file losses, their causes, and all other failures. No, this doesn't mean searching for past tickets, but creating some more refined monitoring of our file storage and retrieval system, and is in itself not a small task.

And finally: we should concentrate on fixing the related bugs instead of relying on a complex, expensive backup system (that will have its own bugs, failures and inconsistencies as well) to overcome those.

(In terms of pure data redundancy, we do have a synchronized copy of all of our swift data in the non-active datacenter as well, but that means we can expect files to be ~ 1:1 between copies, and file losses to be propagated if they're due to some mediawiki bug).

fgiunchedi triaged this task as Medium priority.Apr 12 2017, 8:00 AM

Like to know what is the current backup policy for commons, total storage size, and the architecture of the servers/storage.

Can anyone share the info here or point the relevant links?

Backup of commons files is a part of the more ambitious: "Backup all wiki media files" project being worked currently at: T262668 and subtasks.

jcrespo changed the task status from Open to In Progress.Sep 16 2021, 2:40 PM

This is technically done, although waiting for a redundant copy on a geographically remote site before closing it.

jcrespo added a project: Goal.

Commonswiki is now backed up on 2 geographically redundant locations within WMF infrastructure.

The current architecture details are not yet published, as they may evolve as we add more concrete recovery requirements or alter the underlying technology, as well as frequent execution of backups, but as of now, it is possible to recover easily arbitrary single files from cold storage of the 90 million different file blobs from commons.

This was not easy to implement, mostly because of the current inflexible original Mediawiki metadata.

2 copies from September and December were created and stored on an S3-compatible object storage for fast individual retrieval, and the files' original status were catalogued on a MySQL database.

https://wikitech.wikimedia.org/wiki/Media_storage/Backups will contain more technical details to be filled in soon. Recovery, specially non-trivial cases, will require more work for automation to optimize for flexibility and speed, but as far as the scope of this ticket, this is resolved (no longer old files will be able to be lost).

If someone is interested in contributing to this project I would like to hear from you- as the next step is figuring out recovery use cases, and the requirement for those (how much time would be too much between metadata "snapshots"? For how old should we retain a file's lifecycle history- e.g. history of all its past renames-? What is the expected time to recovery, specially for large recoveries? How should the recovery work- should the recovered files be treated as a new upload, or silently be inserted in the place they used to have, even if it changes its history?).