Hacker News new | ask | show | jobs
by zaroth 3178 days ago
Another form of "crypto anchor" is Blind Hashing which uses a large pool of random data to defend the hashes. An attacker would need to exfiltrate over 90% of the data before they could run an offline attack on hashes blinded by the data pool. The bigger the data pool, the more data an attacker would have to steal, and the more hashes/sec you can run.

So while iterative/computational hashing is only secure if it is slow and if the password is strong, Blind Hashing prevents offline attacks even against weak passwords and actually runs faster as you increase the cost factor.

In this case it's more like an an actual anchor -- technically we call this Bounded Retrieval Model -- the idea that we size the network bandwidth to make it take 300 days at full line rate to steal the data over the network. So it's a physical limitation rather than trusting a black box to protect 256 bits like an HSM.

If you're interested here's an intro [0], a tech spec [1], and an academic paper [2] by Moses Liskov at MITRE.

Disclaimer: I'm Founder/CTO of BlindHash.com which is basicallly Data Pool as a Service -- we provide an API into a geo-replicated 16TB (and growing) data pool.

[0] - https://s3.amazonaws.com/blindhash/BlindHash+Architecture+Gu...

[1] - https://docs.wixstatic.com/ugd/005c1c_5996c661899e4d09a28b9a...

[2] - https://eprint.iacr.org/2017/917.pdf

3 comments

This looks like a pretty good technique, that's coming from someone who has collected 240GB+ of user:password dumps.

I certainly wouldn't get 16TB of disks just for that if it were ever leaked.

Bummer(not for me :p) that you guys went the route of patenting it and keeping it proprietary & only available through an API.

I think it would be adopted in no time if it were open source, and I'd definitely like to see something like this available as a service on clouds like GCP/AWS/Azure/etc for my day job.

Thanks for the kind words and feedback.

The approach has an economy of scale where a shared pool can secure many sites' hashes at very low cost to individual sites, but where the sum-total can fund a very large data pool. I would love to grow this to 1PB and beyond. The idea behind the patent is to give us a chance to try to grow exactly that service.

Fundamentally the technique is quite simple and easy to copy, yet IMO it is better than computational/iterative hashing in every way -- cost, performance, scalability, and security. It seemed to me a perfect example of something worth patenting. If we're ultimately not successful in commercializing it, I would want to relinquish the patent to the public domain.

The most important part -- and what's kept me working at this for years now -- is that it protects even weak passwords after a company is breached. It takes the onus (and a lot of the blame) off the end user, and solves the usability problem with passwords.

By the way, the same technique works equally well for adding BlindHash to your KDF used to decrypt your SSH key, or your laptop or your TrueCrypt volume. We can also add additional checks when running the BlindHash call for a given AppID to enforce things like;

1. must first rely to an SMS or enter a TOTP code 2. Request must come from a certain IP range or during certain hours 3. Request only valid after date X (time lock)

So this can be used to shore up password-based encryption as well in some very interesting ways.

This can't be too hard to build:

1. Generate 16TB of random data, backup/replicate many times

2. Think of data as 16 billion 1k pieces

3. Generate 64 random piece addresses using hashA(key) as seed

4. Concatenate the 64 pieces into one 64k chunk, and store hashB(chunk)

Our algorithm is close to that;

1. We don't partition it into fixed size blocks, but rather index directly into the array

2. The site calculates a salted hash and sends us just the hash. We recommend at least a 32 byte CS-PRNG salt

3. We HMAC the hash with a 64-byte site-specific token (AppID) to produce the seed

4. We generate 64 uniformly distributed locations from the seed and perform 64 reads of 64 bytes each to form a 4096 byte buffer which we HMAC with the AppID to produce a second salt.

5. The site uses this second salt to HMAC their original hash, and store that.

This design allows multiple sites to securely share a single data pool and also means that our service a) does not see usernames or passsords, b) does not know if a login is valid/invalid, c) cannot do anything to make an invalid login look valid to the site.

There are some additional details to handle upgrading hashes as the data pool grows, and also to provide virtual private data pools for each site (so I can give you a copy of your data pool if you ever want to self-host). This is all detailed in [1] above.

That is an excellent idea. But why 16TB of random data ? Why not encrypt some high entropy value (digits of pi, whatever) with a 100 character password and generate 16TB like that. You then use the 16TB as a password but you could regenerate and recover using a scrap of paper.
You can do either. But if you generate the data pool from a seed that you retain, then you're back to trying to protect a 256-bit value from leaking.

Generating the data pool with constantly cycled and discarded keys (i.e. /dev/urandom) means the only way to have the pool is to go and get every single bit of it.

We went the second route because I like sleeping at night and it just felt like retaining a seed would defeat the whole purpose of bounded retrieval.

Sure, but that's a 256-bit value that does not have to be present at the use point. So it's a lightweight anchor ! It's extremely heavy when someone else tries to move it, and yet when you move it yourself, it easily fits in your wallet on the tiniest of sd cards, or even on a scrap of paper.
How about this? Take the old Blowfish block encryption algorithm and eliminate the key expansion and expand it so that the s-boxes and p-array take up 16TB of data? What you'd wind up with is a block cipher that has a 16TB key. Since Blowfish is clearly "prior art," and is unencumbered by patents, this might make this approach harder to attack using patent law.
I don't think it's a bummer that they patented it. In October of 2037 (assuming they received it today), it will be available for the whole world to use. Until then, it will still be available for the whole world to use, just for a small licensing fee. In the mean time alternatives can also be developed.

This technique could have been invented and promoted starting in 1997 (20 years ago) but only through the protectionism of the patent regime do you have this beautiful write-up and promotion of it by researchers pushing it forward: it's the patent regime working in action.

It works EVEN WITH WEAK PASSWORDS. That is pretty amazing if you ask me.

I am glad they patented it and are promoting it.

"But wait, it's so simple".

Let me give you an example of a $684.23B company that you've heard of that is making a mistake in security that even a small child could detect and correct, but for which there is no proprietary solution in the space pushing them forward.

The company is Google and their silly security mistake is that when I give out "jsmith543+weeklytechupdate@gmail.com" where my true address is jsmith543@gmail.com, and I'm signing up for the Weekly Tech Update newsletter but I'm afraid they could start spamming me, or sell my address for any number of third parties to start spamming me, then this allows the creation of a gmail inbox that tags the incoming mail with "weeklytechupdate". Pretty clever. Only the issue is that it is possible to strip the +____ and spammers actually do that. Here are examples of HN people saying they actually do that: https://news.ycombinator.com/item?id=15396446

>I’ve run a fair amount of email campaigns where we strip out the + if gmail is the domain to ensure it doesn’t end up in some weird filter.

The solution is extremely simple. Allow me to specify a key-value pair from the GMail interface that generates a high-entropy key, and pairs it to a value I choose. Deliver all address to that key to my inbox, tagged with the value I chose, until I start marking it as spam. Very easy. Example: I go to gmail, I click "generate rescindable read address address", I am given affj3fjd and I assign it "weeklytechupdate". I see that affj3fjd@gmail.com gets assigned to weeklytechupdate and if I need to give my email address to that web site in the future I can always look it up in some list. Easy. Gmail doesn't do it, and its spam solution is broken.

The only thing is: nobody has come up with something clever enough to patent in this space, and then promote the @#$# out of. If they had, I could give my email addresses out in confidence to whoever I want.

Actually I made a full gmail email address dedicated only for spam. The problem is I can NEVER read the stuff that goes there as I just don't even look. I just looked. The last piece of spam that I got delivered to it occurred 7 days ago. There are just 2 pieces of mail in my inbox.

That means Google's spam filter is very, very, very good. Wait, what? So good that it silently filters spam that I expect to get, that I explicitly give out my email address for? (Okay, I just looked, and there are 2 messages from 4 days ago - nothing more recent - in the "promotions" tab).

No. It's not what it means. It means that some of these sites I give my address out to aren't able to email me at all. They're just not getting through, because GMail's spam fiters are too draconian.

When I give out "jsmith543+weeklytechupdate@gmail.com" I expect ALL of the mail sent to there to go through - not to be caught by the spam filter. Instead, presumably what happens is gmail throws away most mail that isn't sent to an individual by an individual.

Sorry to rant on this aside, I just wanted to show, in action, the difference between a patented solution that a company promotes, versus an EASY solution that would WORK, that GMail doesn't do. It actively does something broken. Nobody has come up with and promotes some fancy solution that works, so instead they don't use the weak solution that works; they use nothing, only a broken non-working security through obscurity solution that you can see HN'ers actively strip out in order to be able to spam effectively.

And this is Google. So this is a question as clear as day for why I don't mind patented novel algorithms with companies behind them licensing and promoting them. I kind of mind when it's a race to the patent office with new technology, but grandparent poster's technique is one that could have been done in 1997 so I don't really buy that excuse. I like that they're patenting it and promoting it. It's a good way to get companies to use better solutions. Companies just don't do it by themselves, as my Google example shows.

>Until then, it will still be available for the whole world to use, just for a small licensing fee

They don't appear to be selling right to use licenses. Most of the text on their site suggests a cloud based service, which I suspect will be usage based.

All that to say it is perhaps too soon to judge the end user cost as small. Maybe it will be, maybe not.

But my point in this case is that if they hadn't patented it and be pushing it we wouldn't even be talking about this. It promotes it OR alternatives.

The impact on consumers is positive even if they only get meager access for 20 years. (For example the patent owner could just be bad at economics and set their price too high, thinking they would get more profit than via wide adoption: they might not set it at the monopolist's profit-maximizing price point.)

Even so, everyone gets it after a while (20 years.)

Simply filter all email to a non-plus address to spam, and then only give out random oh_sigh+aslkdfjslkdjf@gmail.com addresses. Now, if a spammer strips it off, they just get put directly into your spam box. Where stupid regexes don't like the plus, you have the . allowance for gmail, where foobar@gmail, f.oobar@, f.o.obar@, foob.a.r@, etc all get routed to the first address. gmail lets you have up to 30 character user names, so you can encode 2^28 = ~268M unique emails into that. But those sites are very rare.
1. Out of curiosity, do you actually do this? (The first part you propose.)

2. As a theoretical solution it is a bit weaker than the "simple" solution I think Google should obviously do, because under your proposal different spammers can coordinate, invalidating your privacy. (You didn't tell two different unrelated sites that you're the same person, but actually you are, which they could build into a targeted profile if they coordinate or, for example, are owned by the same parent company.) Granted this is a theoretical concern but it is there.

Not exactly - a little more complex actually.

I have two emails: super_private@gmail.com which is only handed out to people I know in real life...I've had this one since 2004 and I still get zero unwanted emails on that address. Then, I have another address, super_public@gmail.com which mass forwards all mail to my super_private email, which then filters it according to the rules I've set up.

The reason I have the extra layer of indirection is because it wouldn't by very user friendly to force someone you know to email you with a plus sign and then some junk. This way I can give a 'normal' email address to normal people, and my filtering email address to auto signups and things like that.

2. You're right - I guess I'm not too worried about a profile being built for me, but this definitely would not handle that issue. I also use anonymous remailers like getnada.com if I am signing up for something which I think is particularly embarrassing if it gets out, but that is rare.

thanks. I also have set up forwarding on some gmails. it's a bit of a pain.
You cannot patent the use of a really long salt. Thats like patenting hashing of any string longer than 2000 chars. Its a trivial operation. They may think they have a patent but I trust it not to hold up. Go build your own 12 TB pool of data to use for salting hashes. I trust theyll never find out or have any ground to sue you if they do.
Their patent is meaningless. A patent for using 16TB of data as a salt is trivial.

password + salt + password or salt + password + salt are known and trivial patterns in hashing. Unpatentable and even if a patent was somehow gotten, unenforcable.

If your salt is 5 characters it can certainly be 500000000 characters instead without the patent overlords having any slimy grounds to indemn you

A similar technique can be used in embedded systems to enhance the speed that password derived keys become unrecoverable from memory after power off: instead of storing the key directly, store it as a value that must be xored with a hash of, say, 4k of random values that are only stored in memory. Then your key is fully unrecoverable after any 256 bits of the 4k bytes have decayed as long as the RNG used to generate the random bytes is suitable and the executed code (including the OS if there is one) is verified to not store temporary values that could be recovered.

For password authentication, IMO a much better solution is to generate strong random passwords (21 character base64) for users and tell them to write them down and/or use a password manager (I think web browser based storage of generated passwords can be done without the user needing to see the password at all). You can still memorize a small number of those over a few weeks if necessary and there is no good reason to memorize a bunch of passwords.

Even though I adore the concept (remember the original posts by J.Spilman in 2012 and kept rolling it in my head for a while), this introduces new remote SPOF for authentication process, doesn't it?
Very flattering that you remember :-) It's still me.

One nice thing about the design is that since the data pool isn't actually storing hashes, it doesn't change over time (except when you want to grow it) it's easy to have multiple data centers that operate completely independently.

Different copies of the data pool, different networks, different DNS, etc. The client library will retry/fail-over between data centers. So while yes, you do have to make a successful API call, it's not a SPOF.

It's very easy to replicate / add redundancy when there's no active sync required between sites. The only inter-site communication we have currently is when new accounts are created, to distribute the AppID, and to aggregate usage stats, which is batched and when it fails will just pickup where it left off once the network is back up.