After the analysis, design, implementation and first run of full backups at both WMF primary datacenters (eqiad and codfw), a workflow for backup and limited recovery is already in place.
If one or a few files were to be lost, corrupted or otherwise unavailable from Commons, or any other wikis hosting media files, there exists a method/application that can easily recover a single object, or a few related ones (e.g. all uploaded with a given title, or with a given hash), from backups and upload it back into the production Swift cluster.
Equally, if there was a full data availability outage (e.g. all or a large part of the files were lost), a "best effort" recovery would be always possible, where all files in the backup would be sent back to production as available on the last backup. This use case, of course, needs large-scale testing to make sure we could optimize for performance, not only parallelizing the recovery, but also the order of it, by providing a priority to each file (e.g. by file type, by access statistics, by wiki usage, by state, etc.). While this needs further preparation, the way to move forward is relatively clear.
However, as with most recoveries, the case in between these two scenarios, where a substantial amount of files are lost, but not all, creates the most challenges. Unlike the media backups, high available media storage in production, on its logical model as used by mediawiki, does not follow an append-only model, and in the opinion of the author of this document (and other numerous mediawiki stakeholders), has clear inefficiencies and defects due to lack of maintenance over the years. Equally, a recovery of every physical, concrete set of bytes will not work without its matching metadata, and the metadata state also changes in the life of the file (it can be the latest version, an old version, being deleted (with all its version) or only a revision if it, and depending on that, this will alter its required metadata and recovery location). There is not any unmutable identifier to consistently track a media file or media page, and the closest thing to it, a hash, uses the older sha1 algorithm, which is has been practically demonstrated to generate collisions rather easily.
As the file changes, the question of how to recover it becomes non trivial- and has the same issues as trying to partially recover a database- some recoveries will be trivial, while other will require decisions and trade offs. For example, a file could be uploaded, then one of its revisions lost, but by the time it is recovered, a new file could have been uploaded as the new, latest version, or been renamed, or removed, or a combination of all. There is no easy way to merge automatically backups and further changes, and not all changes are even tracked or possible to backup easily (e.g. a file can be deleted and restored multiple times, logs can be unreliable of difficult to parse, etc.).
While in an ideal world, we would improve the logical storage model as needed (adding unique identifiers, using hashing other than sha1, having a more append only model, not requiring file changes after upload, etc.), backups have to be working now, and cannot wait a rearchitecture of production media hosting.
As a consequence, in the current status, while backing up every single file at once (full backups) or continuously (streaming backups) it is very easy to implement (through kafka events, or monitoring the database at certain intervals), recovery will require some concessions. For example, following the model of databases, a snapshot of the metadata could be generated (every week? every day?) and we could recover the consistent state of metadata and files to the incremental backup. We could also store those incremental metadata backups for 3 months. Under these assumptions, recovery could be done to the given periods of time, but not more. All files could be retrieved, theoretically (losing at most those between upload and backup, there could be no automatic recovery since the latest metadata snapshot.
Another potential workflow is implementing the recovery as just another mediawiki user- that means that recovery would happen, not directly to swift and the databases, but generating a new upload log, without breaking the mediawiki workflow, even if that means that older references to certain files would be lost. For example, if an old version of a file has been lost, we would upload it again, generating a recent log, which would be visible to end users.
These two, of course, would be one of many possible workflows- it is those that will respond to a media outage that will set their requirements and preference of supported use cases for recovery, and finalization of the backup workflow will be built around that. Several workflows are possible (backups were build with flexibility in mind), although probably not all will be practically supported, or not with the same preference, always having into account the availability of finite backup resources.