Hacker News new | ask | show | jobs
by PeterisP 1057 days ago
The idea of decentralized hierarchical LLMs is interesting but your chosen example is not a good illustration as all three of these data sources are small and insufficient, any model trained solely on any of them will not be a good model for anything. Other things being equal, data quality and domain matters a lot, but a hundredfold increase in data quantity makes an even larger difference.

Datasets like those can be used for fine tuning a pretrained LLM towards a specific domain, but for decent (not even state of art, just anything usable) results you need a large enough dataset to learn English and general world knowledge, and for that the preferable size is "almost everything you can get your hands on", as in, the quantity you'd want to train on is larger than the quantity of good data you can realistically get. Like, the 800 GiB of text at https://pile.eleuther.ai/ is a good start, but if you could get ten times more data (as some of the big companies probably do, since they have access to lots of user-generated non-public text), you should definitely use that.

If you want targeted LLMs then IMHO the proper mindset for data choice is "take everything that you can out of what humanity has ever written and then pick out of that the most suitable 20% for your needs" and that would give much better results than any single dataset that's only Wikipedia-sized.

2 comments

Have you seen the recent work at TinyStories: - https://arxiv.org/abs/2305.07759

It got some nice attention here: - https://github.com/karpathy/llama2.c

I think there may be some applications in this limited space that are worth looking into. You won’t replicate GPT-anything but it may be possible to solve some nice problems very much more efficiently that one would expect at first.

That is not so certain. Microsoft's "Textbooks are all you need" is a case in point. https://news.ycombinator.com/item?id=36413768
That paper kind of does the same thing that my comment above proposed, starting with as large dataset as they can get and then filtering it to extract a much smaller dataset focused on a specific task that still is larger than all of English Wikipedia.