Page MenuHomePhabricator

Migrate SHA-1 hashes to SHA-256 (tracking)
Open, HighPublic

Description

SHA-1 collisions can now be manufactured semi-practically so we should start planning to migrate off SHA-1 to SHA-256.

The attack requires creating a common prefix and suffix that can be slipped into pairs of files, so cannot generally be used to create 'duplicates' of existing files, but can be used to create pairs or sets of files that can confound and confuse expected behavior once put into a common system.

Priority of adding better hashes is highest where the SHA-1 hash is used for unique addressing of user-submitted data, lowest where used to validate server-generated downloads.

Event Timeline

Deskana triaged this task as High priority.EditedFeb 24 2017, 7:00 PM
Deskana subscribed.

Reading the relevant publication, priority seems to be high but not at the "drop everything" stages; although their new technique does make it several orders of magnitude more efficient, the attack is still fairly impractical in terms of CPU/GPU time required.

I'm not sure, if it really needs to be a "High" prio task. SHA1 is known to be insecure since 2005. What we now know is the fact only, that someone really broke it. However, like brion already wrote, the attack announced requires an identical prefix and suffix of both files, and a number of (non-visible) data between to work. This probably is
(a) to cost intensive to break functions inside MediaWiki
(b) not even possible so easily

So, from my point of view, we should evaluate, where we use sha1 hashes and decide on each usage, what priority applies to the specific use case (including how effetively it can be broken).

Please don't misunderstand this comment: I think, the mid-term goal should be to migrate to a more safer algorithm, but I don't think, that it _needs_ to be a high priority :)

Thoughts on if its worth it to ban the prefix of the shattered files on upload? The attack scenario seems very minor at this point, but it might help put users' minds at ease.

Have we considered adding sha256 in addition to (not in place of) sha1. We were discussing informally some mitigation strategies at infrastructure side, and having several hashes checked may be more secure than replacing it with a separate one- sha256 will be broken at some point, I assume, but finding something that works for both sha1 and sha256 will probably take an order of magnitude more (even practically impossible). I was told that is not really hack solution, but a technique used for example for checking debian packages.

As a bonus, old code that is not critical can stay the same and I would guess that adding code and a column on the schema is easier than replacing it. Insecure API calls could be continue being used while a full transition is done. And the column can in the future just be unused.

As a downside, non-mediawiki applications like labs tools could be tempted on not updating the code, continue using sha1, and continue being vulnerable- although I am not sure how impacting that would be.

Im not an expert on crypto so its possible im misinterpreting the paper, but I believe that https://www.iacr.org/archive/crypto2004/31520306/multicollisions.pdf suggests such constructions arent really much more secure then just using sha256.

If I understand the suggestion of @jcrespo correctly, it isn't something to add more security in its own, but make the transition to a new hash algorithm easier (as he wrote: Replacing a hash algorithm is probably more complex as adding a new one). As I understood him correctly, he proposed something like:

  • If a new file is uploaded, generate a sha1 and sha256 hash sum, insteaad of just a sha1 or only a sha256 checksum

Let's assume, someone is uploading another (or "bad") file:

  • The sha1 checksum is generated first and checked against all known sha1 checksums in the database

For mathching files with a sha256 checksum, too:

  • Generate the sha256 checksum of the "new" file and check this, too

For "old" files, which does have only the sha1 checksum:

  • Check, if the file is still existing (it should be) and generate the sha256 checksum on the fly (and, of course, save it in the database) and check it against the new file
  • If the file does, for whatever reason, does not exist anymore, use the sha1 checksum only to compare with the new file, as this is the only thing we can do in this case.

I'm not sure, if this can scale to other usages of sha1 checksums, too, and, of course, I'm also far away from being an expert in this area, so I can be mistaken, that this is at least possible to do and not more insecure as having only one sha1 checksum.

Also: Probably for some (or most?) cases where we use sha1 checksums, we still have the original content (file, text, whatever), so wouldn't it be possible to migrate to another hash algorithm with some amount of time and cpu power? If we do that, we should probably also try to implement another way of storing hashes. Instead of having a column (like img_sha1), we probbaly should have one, which can save one hash (like we do for passwords already), and where the hash itself indicates, what algorithm was used to calculate it? This would at least minimize the database schema changes, whenever we want to change the algorithm.

I'm not sure, if it really needs to be a "High" prio task

"High" was an educated guess based on the discussion. Feel free to change it if that's more accurate. :-)

Im not an expert on crypto so its possible im misinterpreting the paper, but I believe that https://www.iacr.org/archive/crypto2004/31520306/multicollisions.pdf suggests such constructions arent really much more secure then just using sha256.

That paper talks about iterating hashes, e.g. "hashing twice", something that for me is quite logical [that doesn't work]. I am not suggesting that, I am suggesting maintaining the current hash and adding another, checking both, with AND logic, not chaining them.

I also mentioned that there are some downsides, like adding or being tempted to continue using code with the old hash, so I will let the right people decide :-).

Im not an expert on crypto so its possible im misinterpreting the paper, but I believe that https://www.iacr.org/archive/crypto2004/31520306/multicollisions.pdf suggests such constructions arent really much more secure then just using sha256.

That paper talks about iterating hashes, e.g. "hashing twice", something that for me is quite logical [that doesn't work]. I am not suggesting that, I am suggesting maintaining the current hash and adding another, checking both, with AND logic, not chaining them.

I also mentioned that there are some downsides, like adding or being tempted to continue using code with the old hash, so I will let the right people decide :-).

The paper is talking about concatenating (which is equivalent to using AND logic) different iterative hashes. An iterative hash function is a common type of hash function. MD5, SHA-1, SHA-256 are all examples of iterative hash functions. They are not talking about applying the same hash function multiple times.

They are basically saying that the big-oh time to find an input that collides in two different iterative hash functions is roughly the same as finding a collision in just the stronger hash function. (They do say that if both hash functions have shortcut attacks, it depends on the details of the shortcut attack on whether "both" shortcut attacks can be used, but i think we should operate under the assumption that they can)

So we may want to store both hashes to ease the transition period, but we should not assume that it provides any more security than just using the new hash.

Right now, I have as a plan to hash in sha-256 all wiki media files in order to track its backup, and store it on a database: T262668

This is operations outside of Mediawiki, but ping me in case my findings/experience would be helpful to anyone here.

As promised, I have downloaded and hashed 99.94% of commons files with both sha1 and sha-256, and stored them in a separate metadata database (for backups). Sadly, it is difficult to contribute that back, and even keep that up to date, because the lack of proper unique identifiers of blobs on mediawiki, but I will document soon all my learnings about operational challenges of mediawiki workflow- including hashing collisions, hoping that will be useful for someone else. Feel free to ping me in private if you want to know more.