Hacker News new | ask | show | jobs
Ask HN: How complex is it to train your own (tiny) LLM?
2 points by dxsh 953 days ago
I'll preface this by saying I am just getting up to speed on AI, so forgive me if I use any of the terms wrong.

My company is celebrating 10 years next year, and I would love to create a fun little LLM that's trained on our data. I would then like to create a front-end for it, much like ChatGPT. I've got around 5000 articles (not sure how many words) and around 30k images to work with. How would I go on about training a small AI on this dataset? Where do I even get started? It's so overwhelming!

Many thanks.

3 comments

To train a bespoke LLM takes a lot of effort and compute, you are perhaps better off using Retrieval Augmented Generation (RAG). Here's some information from Langchain

https://js.langchain.com/docs/modules/data_connection/ https://python.langchain.com/docs/modules/data_connection/

Also OpenAi last week released Assistants which is an easy way to achieve RAG without needing new tools such as Vector Db's. Although 5000 docs is perhaps to large for assistants.

The first decision is whether you would use an Open Model such as Llama2 and host that yourself or a Model such as GPT 4 from openAi or Claude2 from Anthropic etc.

Thank you! I checked out Langchain last night and wow I am super impressed by how accessible it is.

Do you have any good resources on cleaning up / structuring of data? The 5000 articles I have span multiple years which which means contextual information may be "spread out". The data I have contains dates of when the article was written, I'm pondering how to ensure the LLM doesn't talk about a fact in 2015 like it's still true in the present day.

im agree with this, the efford and time required to train a model usually not is worth, using a RAG with model of 7B or more are, usually more than sufficent
Sam Altman has just shown this weekend how this can super easily be done by "creating your own GPT" from the OpenAI page: https://twitter.com/AlphaSignalAI/status/1722321017446731927
Thanks a lot.

Now the tiny George Carlin in my head won't stop saying "jumbo shrimp, tiny LLM."

Good luck.