| From article "Imagine that we run this query over a few hundred million rows. This means at least a few hundred million hash table operations. As you might imagine, a slow hash table would make for a slower query. A faster hash table? Faster queries!" I'll read the article properly after this, this is just a quick skim, but I can't see this quote can be correct. Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference. Okay, I'll read the article now… Edit: "If you insert "John" and then "Jane" string keys into a FastMap, then that would later become the iteration order. While it doesn't sound like a big deal for most applications, this guarantee is important in the database world. If the underlying table data or index-based access returns sorted data, then we may want to keep the order to avoid having to sort the result set. This is helpful in case of a query with an ORDER BY clause. Performance-wise, direct iteration over the heap is also beneficial as it means sequential memory access." but "...if the underlying table data or index-based access returns sorted data..." Then you've got sorted data, in which case use a merge join instead of a hash join surely. |
In a GROUP BY, you may have a few hundred million rows, but only a few hundred groups within them. A slow function would slow down things dramatically in that case since the hash table remain small and data access is potentially linear.
> Then you've got sorted data, in which case use a merge join instead of a hash join surely.
This property is beneficial for GROUP BY which includes a timestamp or a function over timestamp. QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.
As for merge join, we also use it in ASOF join: https://questdb.io/docs/reference/sql/join/#asof-join