| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by satvikpendem 959 days ago
	The vast knowledge trove of Google can't be understated, even if sometimes the model isn't as competent at certain tasks as OpenAI's GPT models.

2 comments

taneq 959 days ago

If there's one thing that's becoming clear in the open source LLM world, it's that the dataset really is the 'secret sauce' for LLMs. There are endless combinations of various datasets plus foundation model plus training approach, and by far the key determinant of end model performance seems to be the dataset used.

link

satvikpendem 959 days ago

> it's that the dataset really is the 'secret sauce'

alwayshasbeen.jpg

There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006 [0]. The fact that it rings even more true in the age of LLMs is simply just another transformation of the fundamental data underneath.

[0] https://en.wikipedia.org/wiki/Clive_Humby#cite_ref-10

link

still_grokking 958 days ago

2006?

https://en.wikipedia.org/wiki/Scientia_potentia_est

link

satvikpendem 957 days ago

> There have been articles about how "data is the new oil" for a couple of decades now, with the first reference I could find being from British mathematician Clive Humby in 2006

I am specifically referring to the phrase I quoted, not some more abstract sentiment.

link

sharkoz 958 days ago

The best answer to this is https://www.youtube.com/watch?v=ab6GyR_5N6c :)

link

kccqzy 959 days ago

Isn't there just a comment today on HN saying Google had an institutional reluctance to use certain data sets like libgen? I honestly don't think Google used everything they had to train their LLM.

https://news.ycombinator.com/item?id=38194107

link