| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jillesvangurp 434 days ago

You are right that this is the hardest, and most important, thing in search to get right. It's usually referred to as ETL. Extract, transform, load. Load: for each thing, put it somewhere for processing. Transform: for each thing process it by applying some algorithm/algorithms in one or more steps. Load: for each thing, shove it into your store. It's the transform part that is important. Extract and Load are kind of trivial to implement usually. I've seen decent implementations of only a few lines of code. Transform is application specific business logic. E and L are just simple plumbing.

What you query on is not the same as what you store in your DB. And it can be expensive to calculate and re-calculate. Especially at scale. And iterating over all your stuff can be challenging too. It requires IO, memory, CPU, etc. Your application server is the wrong place. And so is your main application database.

The challenge with search is that querying just gets a lot easier if you calculate all the expensive stuff at index time rather than at query time. Different tokenization strategies for different languages, calculating things like page rank, normalization, tokenization, semantic vectors, enriching data with other data (including denormalizing things from other data sources), etc. There are a lot of tricks you can use to make stuff easier to find.

Foregoing all of that indeed makes things simpler and faster. But your search quality will probably suffer. And if you aren't measuring that to begin with, it is probably not great. Doing all these things on write in your main database schema is going to cause other issues (slow writes, lots of schema migrations, complicated logic around CRUD, etc.). The rookie mistake with ETL is just joining the three steps into one thing that then becomes hard to run, evolve, and scale. I see that with a lot of my clients. This is textbook "doing it wrong". It's usually neither fast nor very good at search.

Even if you are going to use postgresql as your main search index, you are probably doing it wrong if your search table/schema isn't decoupled from your main application database via some ETL pipeline. That logic has to live somewhere. Even if it is a bit of a simplistic/limited "do everything on INSERT" kind of thing. That's going to hold back your search quality until you address it. There is no magic feature in postgresql that can address that. Nor in Elasticsearch (though it comes with way more features for this).

I've worked with postgresql's FTS a few times. It's pretty limited compared to Elasticsearch. Anybody running performance benchmarks should be running quality benchmarks instead. Being fast is easy if you skip all the difficult stuff. Being high quality and fast is a big challenge. And it's a lot easier with proper tools and a proper ETL pipeline.

And indeed engineering that such that the two stay in sync requires knowing how to engineer that properly. I usually start with that when I consult clients looking to level up their home grown search solutions to something a bit better.

Of course if you do ETL properly, having your DB and search index in the same place stops making sense. And if you are going to separate them, you might as well pick something more optimal for the job. There are a lot of decent solutions out there for this.