|
|
|
|
|
by davidsainez
209 days ago
|
|
> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner. |
|
https://www.anthropic.com/engineering/contextual-retrieval
They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.
That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)
---
Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.
But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)
I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.