Hacker News new | ask | show | jobs
by umur 3733 days ago
Umur from Citus here. For purposes of this question, I’ll bucket traditional data warehousing (DWH) solutions like Redshift, Vertica, Greenplum together, although there are many nuances among each of them of course.

First, Citus is not a traditional data warehouse. We position Citus as the real-time, scalable database that serves your application under a mix of high- concurrency short requests and ad-hoc SQL analytics (i.e. think both random and sequential scans for a customer-facing analytics app). The default storage engine for Citus is the PostgreSQL storage engine, which is row-based. This is in contrast to many data warehouses, which often use a column store and/or batch data loads, and are focused purely on analytics. The trade-offs you get are: - Citus vs. DWH performance: DWH and Citus both have a similar parallelization for analytics queries (multi-core, multi-machine), but most data warehouses typically use a columnar storage engine instead of a row-based one. Columnar storage is designed for faster analytics queries, so that makes columnar DWH generally faster on longer running analytics queries. However, this comes at the expense of (1) concurrency and (2) short-request performance (think simple lookups, updates, real-time data ingest) vs. Citus' row-based storage. If you've tried having 10s of concurrent connections to Redshift for short lookups, or performing 100s/1000s of inserts/updates to power your application, these limitations will be familiar. This is to be expected, as Redshift is not designed as a real-time operational database, but an offline data warehouse.

In essence, the two classes of products are more complimentary than substitutes, even while they have some overlaps in their analytic capabilities. Something like Redshift will give you fast offline analytics, after you move your data in batch (via S3); Citus will directly power your analytic apps in real-time; without ETL'ing your event/user data back and forth between separate OLTP and OLAP databases. Both can be extremely fast: Redshift can run complex data warehousing queries that take an hour in a few minutes, Citus can scan and aggregate 100 million records in a few seconds, while simultaneously ingesting your events in real-time.

I hope that provides some clarification on the workloads. There is a lot more, including columnar storage and product approach (re: implications of extending Postgres 9.5 vs. forking Postgres 8.x), and I’ll dive into those in separate comments as well.

1 comments

Thank you for the answers, Umur. I've used both Vertica and ParAccel in production environments for the traditional Data Warehousing projects and have come to appreciate both good and the bad that analytic RDBMS engines bring to the table.

Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE.

I'm quite interested in what Citus brings to the market and will be following its progress closely. Once you have a more rounded story for the traditional Data Warehousing purposes, I can recommend it to my clients for evaluation purposes.

In terms of a sweet spot for you, here's a free tip for your sales: target customers of Unica (well, IBM Unica now). That's one application that would definitely benefit from your Operational Analytics positioning - lots of data ingested throughout the day, lots of queries to run for the analytics.

"Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE." HPE employee here (Not a vertica team member, though) Many teams within HP are very excited to use Vertica. Many more teams are looking to use Vertica for our own product offerings. There's no reason, in the short term, for HPE to shift away from Vertica. On the other hand, I'd say that HPE will invest more into Vertica.