|
|
|
|
|
by maxbaines
953 days ago
|
|
To train a bespoke LLM takes a lot of effort and compute, you are perhaps better off using Retrieval Augmented Generation (RAG). Here's some information from Langchain https://js.langchain.com/docs/modules/data_connection/
https://python.langchain.com/docs/modules/data_connection/ Also OpenAi last week released Assistants which is an easy way to achieve RAG without needing new tools such as Vector Db's. Although 5000 docs is perhaps to large for assistants. The first decision is whether you would use an Open Model such as Llama2 and host that yourself or a Model such as GPT 4 from openAi or Claude2 from Anthropic etc. |
|
Do you have any good resources on cleaning up / structuring of data? The 5000 articles I have span multiple years which which means contextual information may be "spread out". The data I have contains dates of when the article was written, I'm pondering how to ensure the LLM doesn't talk about a fact in 2015 like it's still true in the present day.