Hacker News new | ask | show | jobs
by ankrgyl 3193 days ago
You're right, but I don't think the use case you mentioned (as a competitor to B+ trees for direct lookups) is the target use case for a hash index.

A hash index can be a big win for nested loop joins, especially at high concurrency. It is quite common to build a hash table over a subset of the inner table of an equijoin. This is (a) slow to construct and (b) memory-intensive (especially if many of these queries are run concurrently).

With a hash index, a lot of cases that required building a hash table to speed up a query can just use the hash index directly. Furthermore, every concurrent instance of the query can use the same hash index. This is a big win for both performance of a single query (latency) and query scalability.

3 comments

That's a hash join which has existed in Postgres for many years. Hash indexes are unrelated.

The primary use case of hash indexes is situations where the indexes fields are very large. The hash index uses a fixed size regardless of the width in bytes of the key so it "wins" storage wise when the key is wide.

> That's a hash join which has existed in Postgres for many years. Hash indexes are unrelated.

I think the OP ankrgyl is aware of that. To quote:

>> It is quite common to build a hash table over a subset of the inner table of an equijoin. This is (a) slow to construct and (b) memory-intensive (especially if many of these queries are run concurrently).

Ah I think I misread that section with the "nested loop join" stuck in my mind.
I think the win for that kind of scenario is smaller than you make it out to be. I agree on memory usage, but the concurrency argument seems largely faulty because you at the same time increase page-level contention on a shared resource (postgres' buffer cache).

I personally want a lazily filled hash-join / something like a index nested loop join that has a fixed size cache for both positive and negative lookups.

> I don't think the use case you mentioned (as a competitor to B+ trees for direct lookups) is the target use case for a hash index.

Exactly – my comment is directed at those who think they might be, because O(1) sounds way better than O(log(n)). When really, except for data sets in the 10s of billions of rows, there are more subtle reasons to use them (as you point out).