Hacker News new | ask | show | jobs
by buremba 3735 days ago
I think that it would be better for you to position CitusDB by comparing it to other products in terms of use cases.

If the data is big and I need to run analytic queries then I think I have to use a columnar storage format because row-oriented formats cause too much overhead for aggregation queries that usually need to process single column efficiently. If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc. As you said, they're suitable for offline data but Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql? Maybe I can push hot data to a table that use row-oriented format and move the data periodically to another table that uses cstore_fdw and execute queries that fetches data from both cold storage and hot storage tables? If CitusDB makes it easy for me, then I think this is huge.

I guess another use case is using CitusDB as distributed data store and executing filter queries such as "SELECT * FROM table WHERE partition_key = x and predicate1 = y ...". Instead of using multiple Postgresql instances and routing the queries in application level, I can just use CitusDB that takes care of replication && query routing && sharding etc. I think it can also be comparable to databases such as Cassandra, Mongo (using jsonb) since they also have similar use-cases.

Or should I think CitusDB as distributed Postgresql?

1 comments

(Marco from Citus Data)

> If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc.

A particular difference is in response times and concurrency. Data warehouses and Hive are great for reporting use-cases, but not for use-cases that require fast responses and have many users like analytical dashboards. This is a use-case for which Citus is particularly well-suited (see for example the CloudFlare dashboard).

> Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql?

Yes, since cstore_fdw and Citus are both developed by Citus Data we made sure they're fully integrated. We've even seen some deployments that use a mixture of columnar- and row-based storage in a single distributed table.

We find that row-based storage generally has better ingestion performance and more indexing possibilities. Citus can do very fast execution of analytical queries by parallelizing over row-based shards and using the indexes on each of them. However, if you only need a small number of columns and have analytical queries that are not very selective, you can use columnar storage just as easily and even mix and match (might require some support).

> I guess another use case is using CitusDB as distributed data store

Yep, Citus can definitely be used for that by using hash-partitioned tables.