| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gopalv 2024 days ago

> Accumulating more stored data doesn’t make it slower

That is a valid theory when we talk about readers which look at recent data or when you are trying to append data to the existing system.

But in practice, the accumulation of cold data on a local disk is where this starts to hurt, particularly if that has to serve read traffic which starts from the beginning of time (i.e your queries don't start with a timestamp range).

KSQL transforms does help reduce the depth of the traversal, by building flatter versions of the data set, but you need to repartition the same data on every lookup key you want - so if you had a video game log trace, you'd need multiple materializations for (user) , (user,game), (game) etc.

And on this local storage part, EBS is expensive to just hold cold data, but then replicate it to maintain availability during a node recovery - EBS is like a 1.5x redundant store, better than a single node. I liked the Druid segment model of shoving it off to S3 and still being to read off it (i.e not just stream to S3 as a dumping ground).

When Pravega came out, I liked it a lot for the same - but it hasn't gained enough traction.