Hacker News new | ask | show | jobs
by EGreg 1040 days ago
Yeah but the big question I kept having and missing the answer is:

How do you encode the private data into the vectors? It is a bunch of text but how do you choose the vector values in the first place? What software does that? Isn’t that basically an ML task with its own weights, that’s what classifiers do!

I was surprised everyone had been writing about that but neglecting to explain this piece. Like math textbooks that “leave it as an exercise to the reader”.

Claude with its 100k context window doesn’t need to do this vector encoding. Is there anything like that in open source AI at the moment ?

4 comments

It's possible to extend the effective context window of many OSS models using various techniques. The Llama-related models and others there's a technique called "RoPE scaling" which allows you to run inference over a longer context window than the model was originally trained for. (This reddit post help highlight this fact: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkawar...)

But even at 100K, you do eventually run out of context. You would with 1M tokens too. 100K tokens is the new 64K of RAM, you're going to end up wanting more.

So techniques like RAG that others have mentioned are necessary in the end at some point, at least with models that look like they do today.

The most straightforward way, but of course you can fiddle around a lot:

You use sentence transformers (https://www.sbert.net/).

You use a strong baseline like all-MiniLM-L6-v2. (Or you get more fancy with something from the Massive Text Embedding Benchmark, https://huggingface.co/spaces/mteb/leaderboard)

You break your text into sentences or paragraphs with no more than 512 tokens (according to the sentence transformers tokenizer).

You embedding all your texts and insert them into your vector DB.

Many ways to skin this question, but in essence a simple idea is that word vectorization is assigning a numerical representation to a specific word, embeddings on the other hand are taking those words, turning them into numerical representations but keeping semantically similar words closer dimensionally.

Yes, turning words into vectors is it's own class of machine learning. You can learn a lot on the NLP course on hugging face https://huggingface.co/learn/nlp-course/chapter1/1 (and on youtube).

One way to do it is to use cosine similarity[0], the reason to do this is to get around the context window limitation, and hope that whatever text chunks you get which via the similarity function is the correct information to answer your question.

How do you know that Claude doesn't do this? If you have multiple books, you end up with more than 100k context, and running the model with full context takes more time so it is more expensive as well.

[0] https://en.wikipedia.org/wiki/Cosine_similarity