| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 2755 days ago

If Elasticsearch is at the end of an ETL pipeline, does that mean that if Elasticsearch gets corrupted you can rebuild it by re-running the pipeline?

If so I wouldn't call this a "primary data store", since durability isn't critical.

The article says:

> After drafting many blueprints, we went for a Java service backed by Elasticsearch as the primary storage! This idea brought shivers to even the most senior Elasticsearch consultants hired

I'll shiver if Elasticsearch corrupting irreversibly loses data, but if it can be rebuilt from another source I don't see any problems with it at all.

2 comments

KennyCason 2755 days ago

Agreed.

We’ve been running large Elasticsearch clusters as our primary search/analytics engine. While it’s overall very stable, stuff does occasionally happen that requires an index rebuild. We use HBase as our primary store and index via map/reduce or Spark Batch.

As much as I love Elasticsearch, I definitely wouldn’t be able to sleep at night knowing it was the primary datastore.

link

kn7 2755 days ago

We store the real-time content stream in a separate bulk storage unit (e.g., BigQuery) with a certain retention window, but the ETL'ed documents are always on ES. Given a plain event (i.e., not ETL'ed document) is not much of a value for search, I would not call the stream storage as the primary storage. It just assists us to re-build the ETL state in case of an emergency.

link