colliding, fast and slow
I found it hard to locate a good reference explaining how various hash attacks apply to password hashing. Somebody might reasonably ask how the SHA1 collision, or an extension thereof, would apply to bcrypt. Can bcrypt have collisions? It’s a strange question if you know the answer, but knowing that much requires synthesizing a fair bit of knowledge that’s not all in one place.
Start with the usual crypto hashes. Classics like MD5, current standards like SHA2, new hotness like BLAKE2. All of them are supposed to be collision resistant, and it’s bad news when somebody finds that they’re not. A collision attack is pretty simple to understand. Two inputs have the same hash.
An example attack is Mallory generates two messages with identical hashes, “IOU $10” and “IOU $1000”, and borrows $1000 from Alice, who accepts SHA1(“IOU $1000“) = 0x65de12 as a contract. (Digital signatures usually involve signing a hash of the document.) Later, Mallory pays Alice $10 and produces SHA1(“IOU $10“) = 0x65de12 to prove the debt has been paid. Alice is out $990. This is a collision attack. The adversary has control over both messages and the hash.
What if Alice produces the contract and calculates the hash? Now Mallory needs to find a second message, but isn’t able to manipulate the first message or the hash value. This is called a second preimage attack, and it’s much harder to accomplish than a collision attack. Preimage is just a confusing crypto term for input. Once you know that, it’s easy to understand that a second input attack is exactly that: finding a second input that hashes to the same value as the first input. This is more difficult than a collision attack because, very loosely speaking, collisions are usually generated by finding special sequences of blocks that tickle the internal hash state, and you’re not likely to find them by accident.
There’s also something called a first preimage attack. Given only a hash, find an input that produces it. In essence, run the hash backwards to generate an input. (Technically, it’s not required to find exactly the same input. There are an infinite number of inputs that will produce any given hash.) This is a really hard attack. Not only does the attacker not have control over the hash value, they don’t even have a hint as to what input produced it. Just stabbing in the dark.
The difference between collision attacks and preimage attacks explains why MD5 is broken, but HMAC-MD5 is perhaps not. (This does not mean you should continue using HMAC-MD5.) The input to HMAC involves a secret key, unknown to the attacker, and so producing a forged message requires a first preimage attack to recover the key. This is also applicable to password hashing.
An attacker with access to a stolen user database has lots of hashes, but no inputs. It remains computationally infeasible to turn MD5(“password“) back into “password”. However, because MD5 is so fast, it is feasible to run a few hundred billion inputs through MD5 to see what comes out.
For this reason, a good password hash is slow. The popular How To Safely Store A Password covers this well, but doesn’t touch on the issue of collisions. Short answer: irrelevant. MD5 is a terrible password hash, but not because of collisions. Now, if somebody ever does find a first preimage attack on MD5, that would be an amazing breakthrough, but still probably irrelevant for passwords. The computation could well dwarf that required to simply bruteforce the limited input space of likely passwords.
Conclusion: password hashes aren’t broken by cryptanalysis. They’re rendered irrelevant by time (hardware advancements). What was once expensive is now cheap, what was once slow is now fast. The amount of work hasn’t been reduced, but the difficulty of performing it has. When used in a KDF (key derivation function) for something like disk or file encryption, attacks against hash functions are even less relevant. The attacker knows neither the input password nor the generated key.
Now, can bcrypt (or scrypt) even have collisions? Of course. Any function which reduces a large input space to a small output space must have collisions. However, bcrypt works a little differently internally. The password is actually used as a key for the blowfish cipher to encrypt a magic plaintext. The resulting ciphertext is the hash. Attacking this is directly equivalent to a known plaintext attack against the blowfish cipher. So look out for those, but don’t worry about hash collisions.
For completeness, blockchain blockchain blockchain. Collision attacks are relevant, because as in the contract example above, a smart contract may be replaced with a different version. Collision attacks are irrelevant, because as in the password example, shortcutting the proof of work requires a preimage attack.