| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by davidsainez 257 days ago
	> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.

2 comments

andai 257 days ago

Anthropic found a similar result for retrieval: embeddings + BM25 keyword search (variant of TF-IDF) produced significantly better results.

https://www.anthropic.com/engineering/contextual-retrieval

They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.

That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)

---

Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.

But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)

I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.

link

davidsainez 257 days ago

Thanks for sharing! I am working on a rag engine and that document provides great guidance.

And, agreed, each individual technique seems marginal but they really add up. What seems to be missing is some automated layer that determines the best way to chunk documents into embeddings. My use case is mostly normalized mostly technical documents so I have a pretty clear idea of how to chunk to preserve semantics. But I imagine that for generalized documents it is a lot trickier.

link

killerstorm 257 days ago

Well, "vectorization" can be anything. BERT is in same capability class as GPT, very different from LSA people did in 1980s...

link