(Part 3/3 - please see two comments below as the starting point) Aside from the different use-cases they address, there is one other, important difference between Citus and Redshift (and any other distributed database in the world, for that matter). Citus does not fork the underlying database, PostgreSQL. Instead, Citus extends PostgreSQL to transform it into a parallel processing, distributed database. We use PostgreSQL's powerful extension APIs to accomplish this (you simply CREATE EXTENSION Citus on PostgreSQL's latest version, 9.5, to get your distributed PostgreSQL database).
While this might appear as an implementation piece at first, it has important product implications, and might even impact how you might want to think about your database stack. By not forking the core database, you are choosing to always stay with the core PostgreSQL product. For starters, you get the uber-cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in UPSERTs, or the popular PostGIS extension for geospatial capabilities. More philosophically, the moment you use forks of database, you know you'll be diverging over time. And when you introduce new databases and/or piece together many different ones to build one application, your development cycles will only get costlier and more complex over time.
This was a long answer to a short question, but hopefully useful. Let me know if you have questions, or any feedback using Citus – would love to hear your thoughts!
I think that it would be better for you to position CitusDB by comparing it to other products in terms of use cases.
If the data is big and I need to run analytic queries then I think I have to use a columnar storage format because row-oriented formats cause too much overhead for aggregation queries that usually need to process single column efficiently. If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc. As you said, they're suitable for offline data but Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql? Maybe I can push hot data to a table that use row-oriented format and move the data periodically to another table that uses cstore_fdw and execute queries that fetches data from both cold storage and hot storage tables? If CitusDB makes it easy for me, then I think this is huge.
I guess another use case is using CitusDB as distributed data store and executing filter queries such as "SELECT * FROM table WHERE partition_key = x and predicate1 = y ...". Instead of using multiple Postgresql instances and routing the queries in application level, I can just use CitusDB that takes care of replication && query routing && sharding etc. I think it can also be comparable to databases such as Cassandra, Mongo (using jsonb) since they also have similar use-cases.
Or should I think CitusDB as distributed Postgresql?
> If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc.
A particular difference is in response times and concurrency. Data warehouses and Hive are great for reporting use-cases, but not for use-cases that require fast responses and have many users like analytical dashboards. This is a use-case for which Citus is particularly well-suited (see for example the CloudFlare dashboard).
> Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql?
Yes, since cstore_fdw and Citus are both developed by Citus Data we made sure they're fully integrated. We've even seen some deployments that use a mixture of columnar- and row-based storage in a single distributed table.
We find that row-based storage generally has better ingestion performance and more indexing possibilities. Citus can do very fast execution of analytical queries by parallelizing over row-based shards and using the indexes on each of them. However, if you only need a small number of columns and have analytical queries that are not very selective, you can use columnar storage just as easily and even mix and match (might require some support).
> I guess another use case is using CitusDB as distributed data store
Yep, Citus can definitely be used for that by using hash-partitioned tables.
(Part 2) Now comes the storage engine, and cstore_fdw as it relates to PostgreSQL. Built by the Citus Data team, cstore_fdw is entirely a separate component from the Citus product above. It enables columnar storage for your vanilla, single-node PostgreSQL to provide data compression for faster analytics. As such, cstore_fdw does not come with any of the parallelism I've described above that Citus (or Redshift, Vertica etc.) provides.
Precisely because cstore_fdw is built for PostgreSQL, and Citus is PostgreSQL (see Part 3), however, you can still choose to use cstore_fdw as the storage engine for your Citus cluster. Citus will still parallelize the queries as you'd expect it to, but instead of hitting row- based tables, they will hit columnar ones. cstore_fdw has certain limitations, importantly it is not updatable; so we don't consider it as an alternative to a data warehouse. Rather, it is useful if you are archiving your quickly growing timeseries / event data on PostgreSQL or Citus.
To load or append data into a cstore table, you have two options:
You can use the COPY command to load or append data from a file, a program, or STDIN.
You can use the INSERT INTO cstore_table SELECT ... syntax to load or append data from another table.
Note: We currently don't support updating table using DELETE, and UPDATE commands. We also don't support single row inserts.
So I think you can certainly mutate tables, but the focus is on bulk-inserts, rather than individual append actions.
Umur from Citus here. For purposes of this question, I’ll bucket traditional data warehousing (DWH) solutions like Redshift, Vertica, Greenplum together, although there are many nuances among each of them of course.
First, Citus is not a traditional data warehouse. We position Citus as the real-time, scalable database that serves your application under a mix of high- concurrency short requests and ad-hoc SQL analytics (i.e. think both random and sequential scans for a customer-facing analytics app). The default storage engine for Citus is the PostgreSQL storage engine, which is row-based. This is in contrast to many data warehouses, which often use a column store and/or batch data loads, and are focused purely on analytics. The trade-offs you get are:
- Citus vs. DWH performance: DWH and Citus both have a similar parallelization for analytics queries (multi-core, multi-machine), but most data warehouses typically use a columnar storage engine instead of a row-based one. Columnar storage is designed for faster analytics queries, so that makes columnar DWH generally faster on longer running analytics queries. However, this comes at the expense of (1) concurrency and (2) short-request performance (think simple lookups, updates, real-time data ingest) vs. Citus' row-based storage. If you've tried having 10s of concurrent connections to Redshift for short lookups, or performing 100s/1000s of inserts/updates to power your application, these limitations will be familiar. This is to be expected, as Redshift is not designed as a real-time operational database, but an offline data warehouse.
In essence, the two classes of products are more complimentary than substitutes, even while they have some overlaps in their analytic capabilities. Something like Redshift will give you fast offline analytics, after you move your data in batch (via S3); Citus will directly power your analytic apps in real-time; without ETL'ing your event/user data back and forth between separate OLTP and OLAP databases. Both can be extremely fast: Redshift can run complex data warehousing queries that take an hour in a few minutes, Citus can scan and aggregate 100 million records in a few seconds, while simultaneously ingesting your events in real-time.
I hope that provides some clarification on the workloads. There is a lot more, including columnar storage and product approach (re: implications of extending Postgres 9.5 vs. forking Postgres 8.x), and I’ll dive into those in separate comments as well.
Thank you for the answers, Umur. I've used both Vertica and ParAccel in production environments for the traditional Data Warehousing projects and have come to appreciate both good and the bad that analytic RDBMS engines bring to the table.
Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE.
I'm quite interested in what Citus brings to the market and will be following its progress closely. Once you have a more rounded story for the traditional Data Warehousing purposes, I can recommend it to my clients for evaluation purposes.
In terms of a sweet spot for you, here's a free tip for your sales: target customers of Unica (well, IBM Unica now). That's one application that would definitely benefit from your Operational Analytics positioning - lots of data ingested throughout the day, lots of queries to run for the analytics.
"Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE."
HPE employee here (Not a vertica team member, though)
Many teams within HP are very excited to use Vertica. Many more teams are looking to use Vertica for our own product offerings. There's no reason, in the short term, for HPE to shift away from Vertica. On the other hand, I'd say that HPE will invest more into Vertica.
While this might appear as an implementation piece at first, it has important product implications, and might even impact how you might want to think about your database stack. By not forking the core database, you are choosing to always stay with the core PostgreSQL product. For starters, you get the uber-cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in UPSERTs, or the popular PostGIS extension for geospatial capabilities. More philosophically, the moment you use forks of database, you know you'll be diverging over time. And when you introduce new databases and/or piece together many different ones to build one application, your development cycles will only get costlier and more complex over time.
This was a long answer to a short question, but hopefully useful. Let me know if you have questions, or any feedback using Citus – would love to hear your thoughts!