Hacker News new | ask | show | jobs
by bcherry 609 days ago
It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings:

1. chunk the corpus of data (various strategies but they're all somewhat intuitive)

2. compute embedding for each chunk

3. generate search query/queries

4. compute embedding for each query

5. rank corpus chunks by distance to query (vector search)

6. construct return values (e.g chunk + surrounding context, or whole doc, etc)

So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval.