Hacker News new | ask | show | jobs
by danmaz74 4392 days ago
This looks so good, that a question arises: Where's the catch? In other words, in which situations is a columnar DB a bad solution?
4 comments

We use Vertica for our analytics, and basically, INSERT, UPDATE and DELETE are performance risks, as are trying to SELECT large amounts of entities.

As other commenters have said, Vertica, like other columnar DBs, is optimised for reads, and it does so basically through materialized views (called projections) across multiple nodes - so when underlying data is updated, those updates have to be replicated across the system. It's not a huge concern in a reporting / analysis environment though.

Retrieving whole entities is not particularly performant, but if you want to group and aggregate columns, it's blistering fast.

I ran some numbers - our largest stats table in Postgres is 64GB, 379 million rows, and summing a column can take up to 4.5 minutes. Our largest table in Vertica is 500GB, 2.7 billion rows, and summing a column can take up to 100 milliseconds.

I'd have to say that the worst thing about Vertica is that it's not free as in beer or as in speech. Licencing is done on a raw data basis, and the high costs of upgrading licences often drive our technical decisions. And likewise, trying to use the Vertica JDBC library to stream entities from Postgres to Vertica was an interesting exercise (in fairness to Vertica, it was caused by poor implementation of some code in the Postgres JDBC library), because I had only the compiled JARs to work with, and so debugging was difficult.

Other catches - constraints are not enforced - a performance decision. The other issue is that projection design is a very fine art and really determines how much performance you'll get. That said, Vertica ships with a projection designer that designs projections to answer specified queries fast, and the designer is suitable for about 98% of your needs. For the last 2%, you can often find yourself trying to interpret the mind of the query planner when you want it to use your new projection.

...This leads to another catch - as projection structure is strongly tied to query structure, queries that are changed and ad-hoc queries may suffer performance issues, and they do so silently. Vertica does make excellent records of query metrics, but nothing will blow up if an existing projection can no longer answer a query.

The last caveat is that importing data into Vertica requires some thought, and how you do it is typically determined by Vertica's design constraints. The documentation is good, and the training is worthwhile, but there's still a lot of thinking specific to the implementation required.

"basically through materialized views... summing a column can take up to 100 milliseconds"

That's a key point worth expanding on. This isn't so much a columnar/row distinction as it is a point about materializing results (or partial results) in a clever way that allows for fast processing for those kinds of queries.

That moves the problem: queries are faster (as long as they are in the sweet spot of queries that you expected), but you have extra cost at insert/update/delete time and extra effort when designing the "projections" (as you say).

Columnar makes a similar trade in that writes are slower and a class of "typical" grouping and aggregation queries are generally faster.

By combining the two strategies (materialization and columnar), Vertica is essentially doubling-down on that trade. So queries that match what you designed the projections for, and that follow typical patterns of grouping and aggregation, are blazingly fast (or so I've heard). But it's more restrictive about the kinds of queries that will work well, and it's sensitive to new kinds of queries that you might not have planned for when designing the projections (again: or so I've heard). In other words, it's more specialized.

Personally, I don't find ultra-specialized databases very interesting (speaking generally here, not necessarily saying that Vertica is ultra-specialized). I basically see two reasons to use something specialized: (1) replicate data into it and run queries (i.e. an accelerator); and (2) put lots of low value data in it, that you'd otherwise throw away, and try to extract some signal from it. (There are a couple other cases, like real-time systems, but that is outside of what I ordinarily deal with.)

Disclaimer: I work for a big data company that is not Vertica. Our products offer row or columnar as storage options, but that is not central to the products' value propositions. Opinions above are my own.

> Personally, I don't find ultra-specialized databases very interesting (speaking generally here, not necessarily saying that Vertica is ultra-specialized). I basically see two reasons to use something specialized: (1) replicate data into it and run queries (i.e. an accelerator); and (2) put lots of low value data in it, that you'd otherwise throw away, and try to extract some signal from it. (There are a couple other cases, like real-time systems, but that is outside of what I ordinarily deal with.)

Pretty much. We had a specific business requirement of near real-time statistics and historical statistics available at an entity / day level of granularity for advertisers and publishers using our platform.

Vertica occupies a very sweet spot for us in that regard, but it requires a lot of specialised approaches. I've managed to break its tuple mover a couple of times when loading historical raw data in large quantities.

Thanks for the really detailed explanation!
Have you tried monetdb (since your largest table is 500GB/1node?)?
I haven't, but I'll definitely look into it, cheers. :) Our largest table is only that size due to licensing issues, mind.
Its a bad solution for OLTP. C-stores are read optimized while a typical (row-oriented) rdbms is write optimized. Read optimized storage is typically accomplished via a write buffer that is flushed to a read store at a certain epoch.

Basically, if you need to be able to immediately read the data you just wrote, a cstore is a bad choice.

There's also in betweens. Cassandra for example is a bit of a hybrid. It writes quite well even though it has some columnar / big-table aspects.
Ben from CitusData here.

At a high level columnar stores are good for analytical workloads and not great for OLTP workloads. Some reasons for this:

* cstore_fdw currently only supports modifying data via the COPY command. This means that UPDATE and DELETE operations aren't yet possible which limits the utility for OLTP workloads.

* The skiplist indexing system works well for analytical workloads where you are aggregating or retrieving a significant number of rows but wouldn't be great for retrieving single rows.

Great, thanks. We have some OLAP datamarts in our app, I try your solution soon.
Anything non-OLAP. Using a columnar store for your transactional master, or as a document database would probably be disastrously bad, for example.