| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spathak 3714 days ago

Sumedh from Citus Data here.

> Does Citus keeps it's performance over tables with tens of billions of records?

Citus essentially shards the data across machines, and queries these in parallel. You can thus scale out your cluster and CPU cores as you add more data and maintain performance.

> Also, how fast it is for ad-hoc queries over data coming from streams (Kafka/Kinesis) that has not been cached?

By 'cached', do you mean OS or database caching in-memory? Query performance for on-disk data is as fast as you can get with regular PostgreSQL, since each data node is essentially a PostgreSQL node, and each shard a regular PostgreSQL table. Standard tuning like indexes and Postgres configuration parameters will apply here.

1 comments

Tharkun 3714 days ago

Not every query is parallelizable. Maintaining performance is a lie. An easy to grasp is example is computing a median. And I mean an exact median, not an approximation.

link

spathak 3714 days ago

@Tharkun: You are right that not every query is immediately parallelizable. Distinct count's are another example. In some cases data can be re-partitioned so we can calculate exact values and push down computation in parallel. This may provide better performance than a single large table, so there are still benefits to it. Ultimately though there will be tradeoffs to moving to an entirely distributed environment, but depending on the use-case the value may offset those.

link

mtanski 3714 days ago

I'm not sure why folks are downvoting you because most database systems that provide the full array of relational operations (joins, groupby, groupby cube, etc) do not scale linearly (maybe past a handful of nodes). Mixing OLTP / OLAP using current technologies is hard.

link