Hacker News new | ask | show | jobs
by jungletime 2435 days ago
3 years ago, I made a simple calendar app in django, and I wanted to use Elasticsearch so users can search and find an event, and to use it to populate an upcoming events list. There's only about 10,000 events in the database.

I quickly realized what a pain it is to use Elasticsearch, for a simple app like mine.

Pain points:

1) You have to setup and recreate part of your database in elastic search. So you essentially end up with two databases. Which now you have to keep in sync.

2) I was getting unpredictable query results from Elasticsearch, which after a few days, and much head scratching turned out to be that I was running out of memory.

3) When a user added a new event, it was not being added to elastcsearch index automatically. I could not figure out how to do this reliably. I could make it work reliably only after a sync of the entire Elasticsearch index. But this meant that it was next to useless, to use for the Upcoming Events List. Since I only wanted to sync the index once a day. Confusing the users, as to why their event was not showing up. And I gave up, and just ended up implementing the Upcoming Events List directly from my database in python.

4) Elasticsearch came without some security settings not set by default, and after a few months it was hacked. I had to download a new version and wasted more time.

I still use Elasticsearch, but only for search, and not the upcoming event list. And I don't think it was worth the complexity that it added to my project.

9 comments

This is a mistake many people make. Elasticsearch is probably overkill for your particular use case.

It’s similar to bringing a F1 car to a go-cart race and then being surprised you aren’t able to finish the race because you don’t have a pit crew able to maintain your vehicle.

I’ve built and owned large Elasticseach clusters at Fortune 50 companies for providing log search as well as document search. Like anything, administering an ES cluster requires planning, engineering, and process/documentation.

I wouldn’t consider using ES if I didn’t have a dedicated ops team to help in its administration unless possibly using a managed service like the one AWS provides.

It’s a very powerful tool; it was a mistake to think you can just casually throw it in your stack without fully understanding its complexity.

Elasticsearch isn't a database, it's a search engine built on top of Lucene. Although some may use it as a document-store, and their own marketing claims it's ok, it never ends up being a good choice: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...
I have been put off by Elasticsearch's complexity a number of times. Can I ask why with searching a limited about of text you didn't juse use Postgres' full-text search?
I'm not the person you're replying to, but does Postgres nowadays have a straightforward way to do tf-idf or BM25-style information retrieval?
Not op but not that I know of.

I commented below - I highly recommend Xapian for small projects to test the waters. It’s the SQLite if search.

Or you could use the FTS extension of SQLite. https://sqlite.org/fts3.html
I was new to Python, Django and Postgres. I was looking up how to do search and stumbled on articles how to use Elasticsearch in Django. So I went with it.
Should you find yourself there again I recommend Django Haystack and Xapian.
10k? Hell, any vanilla RDBMS can handle this (including SQLite).
we use zombodb https://www.zombodb.com/ to keep ES and Postgresql in sync, it has its flaws but when it works it works perfectly. That at least helps with 1 and 3, which are indeed a major annoyances.
Kafka streams can solve this use case fairly well - though setting up & managing infra may be a bit more than what you'd want to deal with for a hobby project
+1, or use any other log-based replication mechanism (e.g. Logstash). The point is that instead of having two independent systems that can easily go out of sync (if not using distributed transactions) and become permanently incostistent with each other, you'll now have the database as the primary source of data (commonly referred to as the system of record) and Elasticseach as a secondary, eventually consistent search index. This approach sacrifices read-your-writes consistency though, but for a search index this can be tolerated.
If the database is the primary source of data, how do you get the data from there into the log-based replication method? I assumed the OP meant you'd write to Kafka, and the messages would be processed twice: once to write to the DB, and once to ElasticSearch.

Not wanting to do that for a small project, but wanting a better architecture than I've got, I'm curious about your proposed approach.

Two possibilities: either the app writes both to database and Kafka (ideally using an atomic commit) or CDC is setup in Kafka to read database's transaction log (this is faster)

> you'd write to Kafka, and the messages would be processed twice: once to write to the DB, and once to ElasticSearch

This would be equivalent to using a message queue, which (in contrast to log-ordered replication) does not ensure same consistency guarantees (in this case (1) RYW for database writes and (2) database being always at least as up-to-date as the search index)

It requires a lot of pampering, but I quite enjoy using it and discovering it's possibilities. I am using it for the startup project which I work on, that has search feature very similar to Instagram one. Do you have any other suggestions for search engine that is flexible enough? I don't want to couple things too tightly by using zombodb and similar stuff.
If you really need search, I think its the clear winner still. I don't think its terribly hard to manage/operate, just that you do need to do some initial planning otherwise it will balloon out of control.

Dynamic mappings can mess things up really easily, so its best to disable them in favor of using a pre-defined static map for the type of documents you will be ingesting. What I've encountered in the past that usually causes things to break, is when 90% of your documents contain a field called "Date" that contains a ANSI date field, but the other 10% contain "Null" (string instead of an ANSI date). Since the documents don't match the dynamically generated mapping, they fail to be indexed.

Shard management is also critical and this largely depends on the type of data you are indexing. If the data in ES is unique (not just a copy of a database you already have), you will want to have some sort of cross-region/DC replication strategy as well as a backup strategy.

Fortunately both of these are pretty easy. ES has a mechanism of using tags that allows you to define things like regions, data centers, really whatever you want, and shards can be routed based on rules defined over these tags.

A setup I've used in the past is to have 5 nodes in LAX DC 5 nodes in LAS DC, any data that is ingested into LAX is replicated into shards in LAS and vice versa.

Backup to S3 is rather trivial now thanks to the built in export options in the newer versions of ES.

With a little bit of planning ES can be a great addition to your stack, just be sure you do the initial engineering so you can avoid a big headache in the future.

Thanks, very insightful.
That's why whenever I need to have a good full text search I pick https://ravendb.net/ . Literally close to 0 maintenance required and it's very feature rich.
Here it is in glorious action, if anyone is curious: https://www.pincalendar.com/search/events/?q=