| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lobster_johnson 3738 days ago

Surprised Elasticsearch isn't mentioned in the article.

Unlike several of the databases mentioned, it has a data model particularly appropriate for analytics: While only apparently schemaless, its schema is extensible (no need to pre-declare it), and by default every column is indexed. Which means that there's no extra work on the client to assert the existence of indexes for new fields.

More importantly, it does complex, nested, distributed aggregations (top-K, date histograms, etc.) out of the box, and is incredibly fast at it, owing to the columnar-store-like Lucene index model. You can do complicated aggregations across millions of values over several dimensions in milliseconds.

Elasticsearch has consistency issues, though, and even with 2.x and the recent translog support you should probably never use it as a primary data store.

Some of the other databases mentioned (Cassandra, Riak and so on) are useful mostly as primary datastores that get processed into something that can do aggregations. For example, Cassandra -> Elasticsearch is probably a great combo.