Hacker News new | ask | show | jobs
by satvikpendem 959 days ago
The vast knowledge trove of Google can't be understated, even if sometimes the model isn't as competent at certain tasks as OpenAI's GPT models.
2 comments

If there's one thing that's becoming clear in the open source LLM world, it's that the dataset really is the 'secret sauce' for LLMs. There are endless combinations of various datasets plus foundation model plus training approach, and by far the key determinant of end model performance seems to be the dataset used.
> it's that the dataset really is the 'secret sauce'

alwayshasbeen.jpg

There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006 [0]. The fact that it rings even more true in the age of LLMs is simply just another transformation of the fundamental data underneath.

[0] https://en.wikipedia.org/wiki/Clive_Humby#cite_ref-10

> There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006

I am specifically referring to the phrase I quoted, not some more abstract sentiment.

The best answer to this is https://www.youtube.com/watch?v=ab6GyR_5N6c :)
Isn't there just a comment today on HN saying Google had an institutional reluctance to use certain data sets like libgen? I honestly don't think Google used everything they had to train their LLM.

https://news.ycombinator.com/item?id=38194107