| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by macobo 1546 days ago

In practice product requirements can get in the way of technically ideal solutions. One example of this is that analytics products allow users to pass in and analyze arbitrary number of user properties to do analysis on - more than even a columnar database can handle. The current solution of storing JSON as a column indeed has a very significant performance trade-off, but it's also needed to power queries users need to run. This is also why we're really excited about the new Object data type that landed in 22.3 as it handles these cases gracefully by creating dynamic subcolumns.

On JOINs - again, requirements bite us in different ways. Product analytics tool data ingestion pipelines can get quite complicated due to needing to handle merging anonymous and signed in users and user properties changing over time. Handling that via JOINs is as a go-to-market helps avoid that upfront cost by centralising the logic in SQL, but indeed does come with a significant cost in scalability. Delaying in turn allows you to be building tools users need. That said every loan needs to be paid at some point and we're currently knee deep in re-architecting everything to avoid these joins.

Also note that JOINs don't work the way you described from our experience - rather the right hand side of the join gets loaded into memory. The bottleneck there is memory pressure rather than I/O with a good ORDER BY on the table.

All that said, what a great summary of all the different things to keep an eye on. Thanks for reading and sharing your thoughts!

1 comments

barrkel 1545 days ago

I talked about memory latency with respect to joins - that the random access in the hash table is much slower per "row" than vectorized operations on columns. I didn't say I/O would be slow. That I said hash table implies that one side is fully loaded into RAM.

link