| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 0x202020 1197 days ago

I’ve been really impressed with OpenAI embeddings + vector searches for some full document searches compared to something like standard elasticsearch over a large body of text. Something I think I’m personally missing from the ChatGPT/GPT4/LLM conversation with regards to information retrieval is nested/graph hierarchies.

An example from a previous job where we used a hand tooled NLP system was querying for doctors/dentists/optometrists and being able to take something like “dentist near me who is available in the afternoons and speaks spanish”. We would parse this user query into a few different queries that would run against a search cluster and database to return the filtered result set, or the closest output.

What would be the ideal way to prepare or tokenize this data for querying with an LLM? It’s partially text (match dentist, speak Spanish), partially geographic (near me, doing a geo radius of N miles from providers location, and part filtering (who meets all those criteria and has availability in a time frame). Is this a use case for large token sizes to be able to take in all possible providers? Or parsing a query more easily from human language -> SQL/other data store query language? Or perhaps figuring out another way to encode this data?