|
|
|
|
|
by bcherry
609 days ago
|
|
It's kind of interesting because I think most people implementing RAG aren't even thinking about tokenization at all. They're thinking about embeddings: 1. chunk the corpus of data (various strategies but they're all somewhat intuitive) 2. compute embedding for each chunk 3. generate search query/queries 4. compute embedding for each query 5. rank corpus chunks by distance to query (vector search) 6. construct return values (e.g chunk + surrounding context, or whole doc, etc) So this article really gets at the importance of a hidden, relatively mundane-feeling, operation that occurs which can have an outsized impact on the performance of the system. I do wish it had more concrete recommendations in the last section and code sample of a robust project with normalization, fine-tuning, and eval. |
|