Commons:Dumps and backups
This page is intended to be the central place for public information about Wikimedia Commons data dumps and backups.
Dumps
editMedia files
editAs of 2024, there are no publicly available dumps of media files for download since about 2013. There is a request and an open Phabricator ticket to resume them again.
Partial scraping
editThe following tools can be used for Web scraping of files in categories or from search results which is an alternative approach to downloadable or acquirable dumps and complete dumps. It presumably does not work for large categories, especially not Category:CommonsRoot, and a tutorial is missing. Useful resources for scraping Wikimedia Commons files are mw:API:Allimages, Commons:Commons API, meta:API Policy Update 2024, mw:Wikidata Query Service/Categories and mw:API:Categorymembers.
- https://wikilovesdownloads.toolforge.org/ (code & documentation)
- Scripts without Web GUI (untested): WikimediaCommons-scraper, scrape Commons search results images script, scrape JPG images via Commons search, commons-category-downloader PHP script, commons-category-downloader Ruby script
- Dysfunctional: https://magnustools.toolforge.org/can_i_haz_files.html (no documentation, code only embedded in page)
- …
If you use a scraping tool, see if it downloads all filetypes, also files from subcategories, and also the files' metadata such as their categories. Often, after several subcategory layers and/or with special subcategories like "…in art" subcategories, the files there are not very related to the original category. If search results are scraped, see Help:Searching – especially the -deepcategory:"catname"
search operator to exclude files in a certain category – as well as Commons:PetScan.
Wikimedia Commons wiki content
editAll Wikimedia Commons wiki pages, including all of their past revisions in their History (excluding deleted ones), are included in XML dumps, which are generated on a regular basis, and publicly available for download at https://dumps.wikimedia.org/commonswiki/ (info).
Backups
editMedia files
editAll media files (including their past versions) in Wikimedia Commons are backed up in dedicated servers in both Wikimedia Foundation application data centers: Eqiad (Ashburn, Virginia, USA) and Codfw (Carrollton, Texas, USA). These backups are not accessible for the public (including registered Wikimedia users), only to Wikimedia Foundation staff, since they include deleted files and other data that can't be publicly accessible. Backups at each data center are fully independent from each other for redundancy reasons.
Wikimedia Commons wiki content
editWiki content of all Wikimedia wikis (including Commons), which is stored in MariaDB databases, is also backed up in both Wikimedia Foundation application data centers. Those backups also include full version history for all pages.
History
editBefore 2014, when a second facility for redundancy came online, all Wikimedia sites operated from a single application data center (there were, as there are now, more data centers for caching and optimal content distribution, but without any permanent data storage). While having geographical redundancy and XML text dumps in place, no true text-only database backups were implemented until 2020-2021, after several years of work. Backups for media files weren't in place until 2021-2022. Offline backups (for example, on tape) were featured as "coming next" in a Wikimedia Foundation presentation (slide 48), but there is no explicit mention of them in Wikitech nor Phabricator.
Sources
edit- Phabricator ticket: Produce regular public dumps of Commons media files
- Phabricator ticket: WMF media storage must be adequately backed up
- Phabricator ticket: Set up backup strategy for es clusters
- Data centers (Wikitech)
- Media storage/Backups (Wikitech)
- MariaDB/Backups (Wikitech)
- Wikimedia Foundation selects CyrusOne in Dallas as new data center