Hacker News new | ask | show | jobs
by tlipcon 3915 days ago
Yep, that's correct. HDFS+Parquet is more accurate but doesn't fit quite as well on slides and short descriptions.

The idea is to get the analytic scan performance of Parquet while still allowing for in-place updates and row-by-row access like HBase.

HDFS (with Parquet or other formats) will still be better for unstructured or fully immutable datasets. HBase will still be better when your top priority is ingest rate, random access, and semi-structured data. Kudu should be good when you've got tabular data as described above.

1 comments

Impala has an in-memory columnar format on its road map for 2016. Is that format being design with Kudu in mind?

Edit: I understand that the formats, while both columnar, serve different purposes. I am more curious about overlap if any between the two.

Yep, I've been taking part in those design discussions. We hope to have Kudu tablet servers support generating this in-memory format in shared memory as the result of scans, so the Impala server (client from Kudu's perspective) can directly operate on the data. We're expecting a 20-30% speed boost from this for some queries, though haven't done any tests at scale of the prototype.