Hacker News new | ask | show | jobs
by ccozan 1040 days ago
Ok, since is running all private, how can I add my own private data? For example I have a 20+ years of an email archive that I'd like to be ingested.
4 comments

The simplest way, as rdedev is describing, is to do Retrieval Augmented Generation (RAG) in your prompting. This would require the addition of a vector database and a text embedding model. There are many open source / local / private options for that.

The steps would then be: 1. Embed your private data in chunks and store the resulting embeddings in a vector database 2. In your prompting workflow, when a user queries the chat model, embed their query using the embedding model 3. Retrieve the most similar chunks of text from your vector database based on cosine similarity 4. In the chat response, provide it the context of those chunks of text

For example, if you asked "who have I discussed Ubuntu with?", it might retrieve emails that have similar content. Then the model will be able to answer informed by that context.

Looks like this is not easy at all for a non ML expert. And probably the required computing power is still out of reach for mere mortals. I'd have a similar use case to the parent: technical books. I'd love to be able to ask where a certain topic is discussed in my pdf archive and have the AI reply with references and possibly a significant piece of the relevant articles, with images (or local links to them).
This is already a feature of adobe pdf reader professional (called index mode). There’s also an app on macOS called “pdf search” which does quite a good job. I use it for the exact reasons you describe; I’ve got a repertoire of technical books on AWS and Azure and I reference them all the time via my local search engine via these apps.
The computing power is definitely not out of reach of mere mortals. I'm working on software that does this for emails and common documents, generating a hybrid semantic (vector) and keyword search system over all your data, locally.

The computing power we're requiring is simply what's available in any M1/M2 Mac, and the resource usage for the indexing and search is negligible. This isn't even a hard requirement, any modern PC could index all your emails and do the local hybrid search part.

Running the local LM is what requires more resources, but as this project shows it's absolutely possible.

Of course getting it to work *well* for certain use cases is still hard. Simply searching for close sections of papers and injecting them into the prompt as others have mentioned doesn't always provide enough context for the LM to give a good answer. Local LMs aren't great at reasoning over large amounts of data yet, but getting better every week so it's just a matter of time.

(If you're curious my email is in my profile)

It’s not as difficult as you think with libraries like Llamaindex and Langchain.

Both have extensive examples in their documentation for almost identical use cases to the above.

Thanks for all replies. It would probably be worth creating a HOWTO or something like that aimed at non ML experts or complete AI illiterates like myself to help putting together something that works in simple steps (assuming this is possible), from procuring the hardware offering the minimal requirements to organizing data in a way that can be used for training, and finally using the right tools for the job.
There is zero training evolved. The workflow is no different really compared to normal search. Compare to the high level flow of implementing elastic search. The only difference is you are encoding the data using vectors based on a model that best meets your criteria. Tons of howtos out there already on generating embedding a search for LLMs. I think even openai has some cookbooks for this in their docs.
Definitely accessible to anyone who can write code. Very little ML knowledge is necessary. And all of this can be done reasonably on a laptop depending on how large your corpus material is.
Yeah but the big question I kept having and missing the answer is:

How do you encode the private data into the vectors? It is a bunch of text but how do you choose the vector values in the first place? What software does that? Isn’t that basically an ML task with its own weights, that’s what classifiers do!

I was surprised everyone had been writing about that but neglecting to explain this piece. Like math textbooks that “leave it as an exercise to the reader”.

Claude with its 100k context window doesn’t need to do this vector encoding. Is there anything like that in open source AI at the moment ?

It's possible to extend the effective context window of many OSS models using various techniques. The Llama-related models and others there's a technique called "RoPE scaling" which allows you to run inference over a longer context window than the model was originally trained for. (This reddit post help highlight this fact: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkawar...)

But even at 100K, you do eventually run out of context. You would with 1M tokens too. 100K tokens is the new 64K of RAM, you're going to end up wanting more.

So techniques like RAG that others have mentioned are necessary in the end at some point, at least with models that look like they do today.

The most straightforward way, but of course you can fiddle around a lot:

You use sentence transformers (https://www.sbert.net/).

You use a strong baseline like all-MiniLM-L6-v2. (Or you get more fancy with something from the Massive Text Embedding Benchmark, https://huggingface.co/spaces/mteb/leaderboard)

You break your text into sentences or paragraphs with no more than 512 tokens (according to the sentence transformers tokenizer).

You embedding all your texts and insert them into your vector DB.

Many ways to skin this question, but in essence a simple idea is that word vectorization is assigning a numerical representation to a specific word, embeddings on the other hand are taking those words, turning them into numerical representations but keeping semantically similar words closer dimensionally.

Yes, turning words into vectors is it's own class of machine learning. You can learn a lot on the NLP course on hugging face https://huggingface.co/learn/nlp-course/chapter1/1 (and on youtube).

One way to do it is to use cosine similarity[0], the reason to do this is to get around the context window limitation, and hope that whatever text chunks you get which via the similarity function is the correct information to answer your question.

How do you know that Claude doesn't do this? If you have multiple books, you end up with more than 100k context, and running the model with full context takes more time so it is more expensive as well.

[0] https://en.wikipedia.org/wiki/Cosine_similarity

A simple way would be to do some form of retrieval on those emails and add those back to the original prompt
I imagine this means you’d need to come up with own model, even if based on existing one.
And is that hard? Sorry if this is a newbie question, I'm really out of the loop on this tech. What would be required? Computing power and tagging? Or can you like improve the model without much human intervention? Can it be done incrementally with usage and user feedback? Would a single user even be able to generate enough feedback for this?
Yes, this would be quite hard. Fine-tuning an LLM is no simple task. The tools and guidance around it are very new, and arguably not meant for non-ML Engineers.
What are some ways people to get familiar with machine learning engineering who are also working adults
That would require custom training. This project only does inference
No need to fine tune the model. The model could be augmented with retrieved context (as discussed in my sibling comment).