Hacker News new | ask | show | jobs
by ipsi 701 days ago
So here's something I've been wanting to do for a while, but have kinda been struggling to figure out _how_ to do it. txtai looks like it has all the tools necessary to do the job, I'm just not sure which tool(s), and how I'd use them.

Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:

* What does the feat "Sentinel" do?

* Who is Elminster?

* Which God(s) do Elves worship in Faerûn?

* Where I can I find the spell "Crusader's Mantle"?

And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.

I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?

11 comments

Use RAG.

Fine tune will bias something to return specific answers. It's great for tone and classification. It's terrible for information. If you get info out of it, it's because it's a consistent hallucination.

Embeddings will turn the whole thing into a bunch of numbers. So something like Sentinel will probably match with similar feats. Embeddings are perfect for searching. You can convert images and sound to these numbers too.

But these numbers can't be stored in any regular DB. Most of the time it's somewhere in memory, then thrown out. I haven't looked deep into txtai but it looks like what it does. This is okay, but it's a little slow and wasteful as you're running the embeddings each time. So that's what vector DBs are for. But unless you're running this at scale where every cent adds up, you don't really need one.

As for preprocessing, many embedding models are already good enough. I'd say try it first, try different models, then tweak as needed. Generally proprietary models do better than open source, but there's likely an open source one designed for game books, which would do best on an unprocessed D&D book.

However it's likely to be poor at matching pages afaik, unless you attach that info.

Based on what you're looking to do, it sounds like Retrieval Augmented Generation (RAG) should help. This article has an example on how to do that with txtai: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

RAG sounds sophisticated but it's actually quite simple. For each question, a database (vector database, keyword, relational etc) is first searched. The top n results are then inserted into a prompt and that is what is run with the LLM.

Before fine-tuning, I'd try that out first. I'm planning to have another example notebook out soon building on this.

Ah, that's very helpful, thanks! I'll have a dig into this at some point relatively soon.

An example of how I might provide references with page numbers or chapter names would be great (even if this means a more complex text-extraction pipeline). As would examples showing anything I can do to indicate differences that are obvious to me but that an LLM would be unlikely to pick up, such as the previously mentioned in-character vs out-of-character distinction. This is mostly relevant for asking questions about the setting, where in-character information might be suspect ("unreliable narrator"), while out-of-character information is generally fully accurate.

Tangentially, is this something that I could reasonably experiment with without a GPU? While I do have a 4090, it's in my Windows gaming machine, which isn't really set up for AI/LLM/etc development.

Will do, I'll have the new notebooks published within the next couple weeks.

In terms of a no GPU setup, yes it's possible but it will be slow. As long as you're OK with slow response times, then it will eventually come back with answers.

Thanks, I'd really appreciate it! The blog post you linked earlier was what finally made RAG "click" for me, making it very clear how it works, at least for the relatively simple tasks I want to do.
Glad to hear it. It's really a simple concept.
Where can we follow up on this when you're done--do you have a blog or social media?
All the links for that are here - https://neuml.com
All the people saying "don't use fine-tuning" don't realize that most of traditional fine-tuning's issues are due to modifying all of the weights in your model, which causes catastrophic forgetting

There's tons of parameter efficient fine-tuning methods, i.e. lora, "soft prompts", ReFt, etc which are actually good to use alongside RAG and will likely supercharge your solution compared to "simply using RAG". The fewer parameters you modify, the more knowledge is "preserved".

Also, look into the Graph-RAG/Semantic Graph stuff in txtai. As usual, David (author of txtai) was implementing code for things that the market only just now cares about years ago.

Thanks for the great insights on fine-tuning and the kind words!
You can actually do this with LLMStack (https://github.com/trypromptly/LLMStack) quite easily in a no-code way. Put together a guide to use LLMStack with Ollama last week - https://docs.trypromptly.com/guides/using-llama3-with-ollama for using local models. It lets you load all your files as a datasource and then build a RAG app over it.

For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.

Disclosure: I'm the maintainer of LLMStack project

I did something similar to this using RAG except for Vampire rather than D&D. It wasn't overwhelmingly difficult, but I found that the system was quite sensitive to how I chunked up the books. Just letting an automated system prepare the PDFs for me gave very poor results all around. I had to ensure that individual chunks had logical start/end positions, that tables weren't cut off, and so on.

I wouldn't fine-tune, that's too much cost/effort.

Yeah, that's about what I'd expected (and WoD books would be a priority for me to index). Another commentator mentioned that Knowledge Graphs might be useful for dealing with the limitations imposed by RAG (e.g., have to limit results because context window is relatively small), which might be worth looking into as well. That said, properly preparing this data for a KG, ontologies and all, might be too much work.
RAG is all you need*. This is a pretty DIY setup, but I use a private instance of Dify for this. I have a private Git repository where I commit my "knowledge", a Git hook syncs the changes with the Dify knowledge API, and then I use the Dify API/chat for querying.

*it would probably be better to add a knowledge graph as an extra step, which first tells the system where to search. RAG by itself is pretty bad at summarizing and combining many different docs due to the limited LLM context sizes, and I find that many questions require this global overview. A knowledge graph or other form of index/meta-layer probably solves that.

From a quick search, it seems like Knowledge Graphs are particularly new, even by AI standards, so it's harder to get one up off the ground if you haven't been following AI extremely closely. Is that accurate, or is it just the integration points with AI that are new?
First I would calculate the number of tokens you actually need. If its less than 32k there are plenty of ways to pull this off without RAG. If more (millions), you should understand RAG is an approximation technique and results may not be as high quality. If wayyyy more (billions), you might actually want to finetune
Fine-tuning is almost certainly the wrong way to go about this. It's not a good way of adding small amounts of new knowledge to a model because the existing knowledge tends to overwhelm anything you attempt to add in the fine-tuning steps.

Look into different RAG and tool usage mechanisms instead. You might even be able to get good results from dumping large amounts of information into a long context model like Gemini Flash.

No fine-tuning is necessary. You can use something reasonably good at RAG that's small enough to run locally like the Command-R model run by Ollama and a small embedding model like Nomic. There are dozens of simple interfaces that will let you import files to create a RAG knowledgebase to interact with as you describe, AnythingLLM is a popular one. Just point it at your locally-running LLM or tell them to download one using the interface. Behind the scenes they store everything in LanceDB or similar and perform the searching for you when you submit a prompt in the simple chat interface.
Don't have anything to add to the others. Just sharing a way of thinking for deciding between RAG and fine-tuning:

(A) RAG is for changing content

(B) fine-tuning is for changing behaviour

(C) see if few shot-learning or prompt engineering is enough before going to (A) or (B)

It's a bit simplistic but I found it helpful so far.

Very easy to do with Milvus and LangChain. I built a private slack bot that takes PDFs, chunks it into Milvus using PyMuPDF, the uses LangChain for recall, its surprising good for what your describe and took maybe 2 hours to build and run locally.
Seems like using txtai would also be very easy?
Yes, this article is a good place to start: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
I learned about txtai later and it definitely seems cool, maybe I'll rewrite it later.
Typical HN response here but do you have a blog post or a guide on how you did this? Would love to know more..
I used AI, go feed it my comment.