Hacker News new | ask | show | jobs
by icyfox 959 days ago
My general rule of thumb at the moment is:

- Tasks requiring knowledge synthesis between multiple different datapoints (in your case wiki pages) should be fine-tuned so the model's able to do some basic chain of reasoning to reach a new conclusion. Often just fine tuning on the text itself and not (query, text) pairs are sufficient for basic memorization and therefore lookup. The con with this approach is you don't know where the original information came from - the pretrained model, your wiki, or if it's just a hallucination.

- Tasks that require some source of truth for reliability benefit more from the RAG approach, since the summarization layer can explicitly reference the input sources. I use markdown annotations for this output format since it provides inline and easily parsable references to the retrieved content.

RAG is effectively a layer on top of classic information retrieval, like what happens with search engines. The question of how to do the retrieval itself could be semantic embedding-based like what you get back from OAI, tfidf-based, or some other heuristic approximation.

If you have a few initial queries of what people are looking for, I'd start simple with a jupyter notebook, the OAI fine-tuning API, and numpy before jumping to off-the-shelf solutions that promise to solve this problem for you. It will build more of an intuition of your data and the tradeoffs required.