Hacker News new | ask | show | jobs
by mrigor 3454 days ago
https://en.wikipedia.org/wiki/SipHash
1 comments

As someone who's not at all versed in cryptography, (yet) anyone want to put this in more laymen terms? Maybe compare it to an existing cryptographical-method I might understand or at least know of?
Hash tables are very important data structures in the computer science world. Hash tables allow you to have amortized O(1) access cost to arbitrary elements - like key/value (often called: map, dict).

In order to implement hash table one need to map key onto an index in the table - this is done with a hash function: hash(string) ---> number.

Here's a problem. If an attacker knows the hash function, she can produce many strings that will give the same number in return. This usually wasn't a problem, but in the web world it is. It is possible to flood the server (usually in python, ruby, perl) with such crafted requests that, for example, all headers will end up with precisely the same hash value: hash(any_given_header_in_request) ---> fixed value.

This is will result in hash table collision and is generally bad. Normal hash functions can't solve this. This problem of maliciously creating hash collisions is called "hash flooding".

Siphash is an attempt to solve the problem. It is more than a hash function - it's a crypto PRF function and that gives you more guarantees than dumb hash function. Most importantly it takes two values: a "string to hash" and a "crypto key": siphash(string, crypto_key) --> number.

The idea is to generate this "crypto_key" randomly on each program execution, to make sure the attacker can't predict it.

Crypto speaking hash functions may be reversible. There is nothing guaranteeing that they are not. But Siphash is a PRF, and in crypto-speach this means it's not reversible. If you can produce an efficient algorithm to reverse Siphash - ie: given crypto key and hash value predict input string - you can write a good paper and be famous.

> Here's a problem. If an attacker knows the hash function, she can produce many strings that will give the same number in return.

I may be misunderstanding you, but isn't the point of a (good) cryptographic hash function that you cannot produce the multiple plaintext which will give the same value, despite knowing the hash function?

In practice you take X lower bits of the hash value. For example 10 bits if the hash table size is 1024. It's trivial to find many input stings which hash(string) % 1024 == fixed value, for any, even best cryptographic hash function.
> Siphash is an attempt to solve the problem. It is more than a hash function - it's a crypto PRF function and that gives you more guarantees than dumb hash function. Most importantly it takes two values: a "string to hash" and a "crypto key": siphash(string, crypto_key) --> number.

That sounds suspiciously like an HMAC.

The difference is in the detail.

An HMAC gives you something very nice, but it's of its choosing. If you want a 256-entry cache, the security properties of a 256-bit HMAC result is not a good fit for the needs of a cache key.

If you want an infinitely big cache, an HMAC gives you zero collisions, but in this case what you want is an 8-bit result that spreads nicely over the 256 values, doesn't let an attacker fill one bucket, and doesn't let an attacker learn anything about the other users and their cache entries.

It still sounds like an HMAC albeit with a 64-bit output rather than a larger (say 256-bit) output.

From reading up on it a bit more, it is an MAC[1]. The difference between it and a generic HMAC construction (say with HMAC-SHA256) is that it's intended to specifically be an MAC which leads to more efficiency.

> An HMAC gives you something very nice, but it's of its choosing. If you want a 256-entry cache, the security properties of a 256-bit HMAC result is not a good fit for the needs of a cache key.

It'd be a fine fit from a cryptographic perspective. It's just slow relative to something like this which more catered to the specific problem.

> If you want an infinitely big cache, an HMAC gives you zero collisions, but in this case what you want is an 8-bit result that spreads nicely over the 256 values, doesn't let an attacker fill one bucket, and doesn't let an attacker learn anything about the other users and their cache entries.

The result of SipHash is 8-bytes (not bits). You'd still have to truncate or consolidate the bits to reduce it to 1-byte (8-bits).

[1]: https://en.wikipedia.org/wiki/SipHash

> 8-bit result that spreads nicely over the 256 values, doesn't let an attacker fill one bucket, and doesn't let an attacker learn anything about the other users and their cache entries.

Willing to be educated, but won't I get that by just masking off 8 bits off of say SHA-x?

[edit: h(s + secret-bits) & 0xff ]

Chopped HMAC-SHA will have similar properties to SipHash, however it is a lot slower.
People do that. I've done it myself. But all the math about SHA's security properties is about absence of collisions in the full result, not about relative frequency in a small part of the result.

Maybe SHA-8 works fine. I don't know. I've never seen any real mathematical investigation of that, and that's the point.

it is a lot like an HMAC. The advantage of something like SipHash is just that it's faster for the same security, not that it's fundamentally different from HMAC.
I've been looking at using this recently, and had a stupid question about using it -

Is it ok to use the same random key with siphash for lots of different hashes as long as the key is secret and mutates once per launch (i.e. generating once on app startup and use it for all hashes)?

The danger in your scheme is that an attacker who finds a way to see lots of keyed hash values in one hash table can flood a different hash table. You should keep a 16 byte key per hash table.
Thanks, I'll only have one set of hashes. My question was really is it ok to use one key per table of hashes, or does it have to be one key per hash. I think you're saying one per table is fine, but wanted to be sure I wasn't doing it wrong. I'll have another look for examples.
It's one key per table.
ELI5 (distilled from wikipedia page linked by GP, forgive me any errors):

Web servers use hash tables for storing per-request data. If an attacker knows the hash function (say, SHA1), they can create a few hundred requests that yield the same hash, giving hundreds of hash collisions and creating a Denial-of-Service attack with the same effect as millions of ordinary requests. It's a form of DoS amplification.

A keyed hash function fixes this by keeping part of the hash algorithm (the key) secret. You can turn e.g. SHA1 into a keyed hash function by e.g. HMAC, but that's computationally expensive. SipHash, being a "natively keyed" hash function, is much faster.

Except if the hash is SHA1 (or any other non-compromised cryptographic hash) they would have a very difficult time creating requests that yield the same hash, that aren't the same request, right?
I'm not sure, but I think if you're just interested in creating N collisions, as opposed to finding something that collides with given plaintext X, that the birthday paradox gives you a huge performance increase and makes it feasible. Also, I think many still use MD5 in applications where SipHash is intended (at least the Linux kernel does).