How efficient is it now? The last time I checked, FHE required minutes of computation and gigabytes of memory to store tiny amounts of data, and since it does not hold IND-CCA, I could not find any use cases.
Very inefficient. Like wildly so. Specifically if you have a very small database and you preprocess it with their techniques, the resulting database is petabytes in size. But the results are very beautiful.
There are no obvious ways to improve on this work, so it is not a matter of engineering. We really do need another breakthrough result to get us closer to practicality.
Pretty efficient! E.g. a recent paper describes a system to do fully private search over the common crawl (360 million web pages) with an end to end latency of 2.7 seconds: https://dl.acm.org/doi/10.1145/3600006.3613134
It used 145 core-seconds of compute, according to the paper. That's not 100% unless they were mostly 1 core servers.
However, the search used almost 50MiB of traffic before the query was even sent by the client. I imagine that will add up fast.
You’re right, I did my math with 1 core assumption. Still, it’s 800ms of 100% utilization of all 4 vCPUs across the entire cluster. That’s 40% capacity if the 2s latency was purely CPU bound and the client was next to the server, but we know the initial phase involves some round trips to communicate that 50mib so the 40% is a floor. My point remains that this is several orders of magnitude off in terms of cost still (as you mention network bandwidth required is off by similar orders of magnitude as cpu).
Many many orders of magnitude. It really depends on the search algorithm. If you're looking up a single term, there are many standard ways to index this, but for instance using a trie or hash table, the lookup cost would be a number of operations proportional to the search term length (e.g. for "foo", it's a of length 3, times some small number of operations for each letter). Plus then if you want to combine the results and sort by frequency, to estimate the cost of this, add up the number of pages from each term, p_1 + p_2 + ... = P. Then the number of operations might be P*log(P). This dominates the cost. So if your search terms appear in 1,000,000 of these pages, you're talking about on the order of ~6,000,000 small operations. An implementation would be able to execute this many operations on a single CPU in 10s or 100s of milliseconds maybe, for an elementary unoptimized algorithm.
The benchmarks listed there are approximately 100 times slower than the original Intel 8088 microprocessor released 1979 on which the original IBM PC was based.
That microprocessor was efficient enough for many applications of general purpose computing, but we still need Moore's law to give us a 100-fold increase in compute power to reach this level.
This level of performance seems comparable to (and IMHO slower than) the very first electronic stored-program computer, the 1948 Manchester Baby.
In other words, this can be compared to do the computation manually, but with the guarantee that whoever is performing the computation cannot glean anything from it.
This can be OK for some relatively rare and important computations, which you for some reason must run in a software-untrusted environment. I can imagine using it for some high-stakes, small-scale automated voting. All the intermediate results may be posted in the encrypted form as an audit trail. After the process is completed and results are declared, the encryption keys are released, and any party can check that every step of the computation determining the winner was correct, there was no stuffing.
Not an answer, but a question that I hope someone can answer. Is the lack of speed because of a lack of optimization or compute? Is this something that could be fixed by an accelerator?
It's often hard to compare things in an accurate way. I mean many current methods might already have hardware specific acceleration and researchers aren't always good at optimization (why should they be?). So is the problem of FHE a bigger lack of just not enough people (specialists) putting in time to optimize it or is it so computationally intensive that we need hardware to get better before it becomes practical? I mean look at llama.cpp and how much faster it is or stable diffusion? Sure, both kinda slow (diffusion's no GAN and SD's not straight diffusion) but very usable for many applications.
Based on my recollection of a conversation with the authors after their STOC talk: the RAM scheme is not efficient enough to be competitive with circuit-based FHE schemes; for problems with low RAM usage, existing circuit-based methods are more efficient, and problems with higher RAM usage are infeasible to run on any FHE method right now.
They were 50/50 on whether or not this technique could be made feasible in the same way that, say, the CKKS scheme is.
There are no obvious ways to improve on this work, so it is not a matter of engineering. We really do need another breakthrough result to get us closer to practicality.