| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by umur 3736 days ago

(Part 3/3 - please see two comments below as the starting point) Aside from the different use-cases they address, there is one other, important difference between Citus and Redshift (and any other distributed database in the world, for that matter). Citus does not fork the underlying database, PostgreSQL. Instead, Citus extends PostgreSQL to transform it into a parallel processing, distributed database. We use PostgreSQL's powerful extension APIs to accomplish this (you simply CREATE EXTENSION Citus on PostgreSQL's latest version, 9.5, to get your distributed PostgreSQL database).

While this might appear as an implementation piece at first, it has important product implications, and might even impact how you might want to think about your database stack. By not forking the core database, you are choosing to always stay with the core PostgreSQL product. For starters, you get the uber-cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in UPSERTs, or the popular PostGIS extension for geospatial capabilities. More philosophically, the moment you use forks of database, you know you'll be diverging over time. And when you introduce new databases and/or piece together many different ones to build one application, your development cycles will only get costlier and more complex over time.

This was a long answer to a short question, but hopefully useful. Let me know if you have questions, or any feedback using Citus – would love to hear your thoughts!

1 comments

buremba 3736 days ago

I think that it would be better for you to position CitusDB by comparing it to other products in terms of use cases.

If the data is big and I need to run analytic queries then I think I have to use a columnar storage format because row-oriented formats cause too much overhead for aggregation queries that usually need to process single column efficiently. If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc. As you said, they're suitable for offline data but Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql? Maybe I can push hot data to a table that use row-oriented format and move the data periodically to another table that uses cstore_fdw and execute queries that fetches data from both cold storage and hot storage tables? If CitusDB makes it easy for me, then I think this is huge.

I guess another use case is using CitusDB as distributed data store and executing filter queries such as "SELECT * FROM table WHERE partition_key = x and predicate1 = y ...". Instead of using multiple Postgresql instances and routing the queries in application level, I can just use CitusDB that takes care of replication && query routing && sharding etc. I think it can also be comparable to databases such as Cassandra, Mongo (using jsonb) since they also have similar use-cases.

Or should I think CitusDB as distributed Postgresql?

mslot 3736 days ago

(Marco from Citus Data)

> If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc.

A particular difference is in response times and concurrency. Data warehouses and Hive are great for reporting use-cases, but not for use-cases that require fast responses and have many users like analytical dashboards. This is a use-case for which Citus is particularly well-suited (see for example the CloudFlare dashboard).

> Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql?

Yes, since cstore_fdw and Citus are both developed by Citus Data we made sure they're fully integrated. We've even seen some deployments that use a mixture of columnar- and row-based storage in a single distributed table.

We find that row-based storage generally has better ingestion performance and more indexing possibilities. Citus can do very fast execution of analytical queries by parallelizing over row-based shards and using the indexes on each of them. However, if you only need a small number of columns and have analytical queries that are not very selective, you can use columnar storage just as easily and even mix and match (might require some support).

> I guess another use case is using CitusDB as distributed data store

Yep, Citus can definitely be used for that by using hash-partitioned tables.