Hacker News new | ask | show | jobs
by yoelhacks 610 days ago
I used to work on an app that very heavily leaned on Elasticsearch to do advanced text querying for similarities between a 1-2 sentence input and a corpus of paragraph+ length documents.

It was fascinating how much tokenization strategies could affect a particular subset of queries. A really great example is a "W-4" or "W4" Standard tokenization might split on the "-" or split on letter / number boundaries. That input now becomes completely unidentifiable in the index, when it otherwise would have been a very rich factor in matching HR / salary / tax related content.

Different domain, but this doesn't shock me at all.

2 comments

The trained embedding vectors for the token equivalents of W4 and W-4 would be mapped to a similar space due to their appearance in the same contexts.
The point of the GP post is that the "w-4" token had very different results from ["w", "-4"] or similar algorithms where the "w" and "4" wound up in separate tokens.
Yes, used to work on a system that has elasticsearch and also some custom Word2Vec models. What had the most impact on the quality of the search is ES and on the quality of our W2V model were tokenization and a custom ngrams system.