Commons:Dumps and backups

This page is intended to be the central place for public information about Wikimedia Commons data dumps and backups.

Dumps

edit

Media files

edit

As of 2024, there are no publicly available dumps of media files for download since about 2013. There is a request and an open Phabricator ticket to resume them again.

Partial scraping

edit

The following tools can be used for Web scraping of files in categories or from search results which is an alternative approach to downloadable or acquirable dumps and complete dumps. It presumably does not work for large categories, especially not Category:CommonsRoot, and a tutorial is missing. Useful resources for scraping Wikimedia Commons files are mw:API:Allimages, Commons:Commons API, meta:API Policy Update 2024, mw:Wikidata Query Service/Categories and mw:API:Categorymembers.

If you use a scraping tool, see if it downloads all filetypes, also files from subcategories, and also the files' metadata such as their categories. Often, after several subcategory layers and/or with special subcategories like "…in art" subcategories, the files there are not very related to the original category. If search results are scraped, see Help:Searching – especially the -deepcategory:"catname" search operator to exclude files in a certain category – as well as Commons:PetScan.

Wikimedia Commons wiki content

edit

All Wikimedia Commons wiki pages, including all of their past revisions in their History (excluding deleted ones), are included in XML dumps, which are generated on a regular basis, and publicly available for download at https://dumps.wikimedia.org/commonswiki/ (info).

Backups

edit

Media files

edit

All media files (including their past versions) in Wikimedia Commons are backed up in dedicated servers in both Wikimedia Foundation application data centers: Eqiad (Ashburn, Virginia, USA) and Codfw (Carrollton, Texas, USA). These backups are not accessible for the public (including registered Wikimedia users), only to Wikimedia Foundation staff, since they include deleted files and other data that can't be publicly accessible. Backups at each data center are fully independent from each other for redundancy reasons.

Wikimedia Commons wiki content

edit

Wiki content of all Wikimedia wikis (including Commons), which is stored in MariaDB databases, is also backed up in both Wikimedia Foundation application data centers. Those backups also include full version history for all pages.

History

edit

Before 2014, when a second facility for redundancy came online, all Wikimedia sites operated from a single application data center (there were, as there are now, more data centers for caching and optimal content distribution, but without any permanent data storage). While having geographical redundancy and XML text dumps in place, no true text-only database backups were implemented until 2020-2021, after several years of work. Backups for media files weren't in place until 2021-2022. Offline backups (for example, on tape) were featured as "coming next" in a Wikimedia Foundation presentation (slide 48), but there is no explicit mention of them in Wikitech nor Phabricator.

Sources

edit

See also

edit