Hacker News new | ask | show | jobs
by r888888888 3545 days ago
It's not that simple.

If you're using a secondary data store like Elasticsearch presumably you're combining it with a traditional relational database like Postgres or MySQL. And in most use cases, you're going to require additional constraints on top of full text, like restricting to a specific customer or topic, etc. This means data duplication. Any experienced developer should be breaking out in hives at this point because this means (1) keeping two data stores in sync; (2) adding another service to the deployment process; (3) adding another service to development environments; (4) adding another service to debug when problems crop up; (5) large scale schema changes and migrations are complicated because of two data stores; (6) operational overhead from maintaining Elasticsearch like separate backup scripts and deployment scripts and sharding config (across production, qa, and dev); (7) queries are potentially complicated because you may have to aggregate information from two data stores (pagination and delayed updates makes this even more hellish); (8) if your company is anal about devs accessing production data then ES just introduces another layer of requesting permissions and every time the schema changes you have to ask again.

And maybe the advantages of Elasticsearch far outweigh these negatives, but this is a conversation you're going to have with your devs because the cost of using ES is not zero. There are tradeoffs, and maybe just using Postgres full text search is adequate for your use case given sufficient additional constraints.

4 comments

Data duplication virtually always happens as the organization gets larger. Even if you stick to RDBMS-only, you're going to get it once you setup master/slave replication, batch jobs, data science, tape backups, etc.

It's not a serious problem as long as you designate one data source as primary and every other data source as derived. That way, if you ever get data corruption issues, you can re-create the derived data and blow away the old copy. You also don't face complex data-sync issues, because all data-transfer is unidirectional. You write into Postgres or your other source-of-truth, and then you write from Postgres to ElasticSearch. If there's a difference between them, the Postgres version is authoritative, so copy its data over ElasticSearch.

Most of your other points are common to adding any new major feature, and you should always weigh the user benefits against the maintenance costs of a large new dependency. Presumably, if you're looking at ElasticSearch, it's because full-text search in your RDBMS is not good enough (which won't always be the case, but has been my experience and apparently the experience of several other people here).

For those thinking that using Elasticsearch is "trivial", please read the parent comment. And read it again. This is pure wisdom.
Not only do some of us no break out in hives at data duplication we actually design with it in mind!

http://martinfowler.com/bliki/CQRS.html

The devops issues aren't a valid counter-point. Tuning an RDBMS is an art.

It's much easier to scale a ephemeral search service than a transactional database. Likely even one search node will support a higher query throughput.