| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by inkyoto 596 days ago

> The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.

If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.

Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.

It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.

Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.