| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andy99 966 days ago
	What is the use case for an 8k token embedding? My (somewhat limited) experience with long context models is they aren't great for RAG. I get the impression they are optimized for something else, like writing 8k+ tokens rather than synthesizing responses. Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt? Where is it better to have coarser retrieval?

3 comments

dragonwriter 965 days ago

> What is the use case for an 8k token embedding?

Calculating embeddings on larger documents than smaller-window embedding models.

> My (somewhat limited) experience with long context models is they aren't great for RAG.

The only reason they wouldn't be great for RAG is that they aren't great at using information in their context window, which is possible (ISTR that some models have a strong recency bias within the window, for instance) but I don't think is a general problem of long context models.

> Isn't the normal way of using embedding to find relevant text snippets for a RAG prompt?

I would say the usual use is for search and semantic similarity comparisons generally. RAG is itself an application of search, but its not the only one.

3abiton 965 days ago

I wonder how the perfomance fair when context size is increased. Intuitively this should be higher, but some quantized models I've tested showed noticeably worst performance.

Kubuxu 965 days ago

Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.

woadwarrior01 965 days ago

BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.

teaearlgraycold 966 days ago

You could get a facsimile to a summary for a full article or short story. Reducing an 8k token article to a summary using a completions model would cost far more. So if you need to search through collections of contracts, scientific papers, movie scripts, etc. for recommendations/clustering then bigger input sizes can do that in one shot.

Think of it like skipping the square root step in Euclidean distance. Perfectly valid as long as you don’t want a distance so much as a way to compare distances. And doing so skips the most computationally expensive operation.

refulgentis 966 days ago

I think I'm missing something: like, yeah, it's vector search for bigger text chunks. But arguably vector search with bigger text chunks is _definitively_ worse -- this isn't doing summarization, just turning about 25 pages of text to 1024 floats, which you then can use cosine similarity to measure the semantic similarity to other text

I'd much rather know what paragraph to look in than what 25 pages to look in

simonw 966 days ago

I imagine it's more useful for finding related articles and clustering things than for semantic search, which will work much better against smaller chunks - especially if you're implementing Retrieval Augmented Generation.

rolisz 966 days ago

I think the point is: if you compress 25 pages of text into 1024 floats, you will lose a ton of information, regardless of what the use case is, so you're probably still better of with chunking.

TeMPOraL 965 days ago

> if you compress 25 pages of text into 1024 floats, you will lose a ton of information

Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise.

Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both.

kordlessagain 965 days ago

I think that the assumption that you lose a ton of meaning (of low frequency) in doing separate chunks is probably less likely to be true over doing the whole document at once (losing high frequency meaning). As you say, doing both is probably a good strategy, and I think that's why we see a lot of "summarize this text" approaches.

I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.

This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.

There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.

simonw 966 days ago

I've been getting great results for related documents by embedding entire blog posts, e.g. here: https://til.simonwillison.net/gis/pmtiles#related

I'm not sure how I would do that after chunking.

thomasahle 965 days ago

Did you compare with simple baselines like bag-of-words and word vectors?

imranhou 965 days ago

Good point, I wonder how different it is to use a large context here vs having some other model summarize an 8k article into a small paragraph and using embedding from the paragraph instead where such a large context wouldn't be necessary.

teaearlgraycold 965 days ago

Ever read the back of a book?

TeMPOraL 965 days ago

You mean the marketing blurb? Those tend to carry low information value, sometimes even negative - as in, if you didn't know anything else about the book, reading the blurb will make you even more wrong about it than you were. This is a common feature of marketing copy.

scotty79 965 days ago

Isn't it up to 8k? So you can index your documents by paragraphs if you prefer?

antman 965 days ago

you could do both

kristopolous 966 days ago

Is this what you mean by RAG? https://www.promptingguide.ai/techniques/rag?

simonw 966 days ago

I have an explanation of RAG in the context of embeddings here: https://simonwillison.net/2023/Oct/23/embeddings/#answering-...

Grimburger 965 days ago

You could just sum it up for us all rather than do a divert to your blog?

It's Retrieval Augmented Generation btw.

To quote:

> The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.

> The LLM can then answer the question based on the additional content you provided.

simonw 965 days ago

> You could just sum it up for us all rather than do a divert to your blog?

Why? Have links gone out of fashion?

I even linked directly to the relevant section rather than linking to the top of the page.

The paper that coined the term used the hyphen, though I think I prefer it without: https://arxiv.org/abs/2005.11401

Grimburger 965 days ago

> Have links gone out of fashion?

Yes.

You wrote far more words than needed to answer the comment, I did it for you instead.

simonw 965 days ago

One of the reasons I write so much stuff is so I can provide links to things I've written to answer relevant questions.

gkbrk 965 days ago

"Links have gone out of fashion" is an odd thing to write on a Link Aggregator website.

kristopolous 965 days ago

You know you're responding to a programmer famous enough to have a Wikipedia page, right?

https://en.m.wikipedia.org/wiki/Simon_Willison

teaearlgraycold 966 days ago

Yes