| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 729 days ago

Because most companies genuinely don't value training on user data in that way.

It just isn't that valuable, even without the huge amount of negative publicity attached to doing that.

The cutting edge AI labs are leaning much more into high quality data (licensed from the Associated Press for example) and synthetic data, which it turns out is a huge part of Claude and Microsoft's Phi series.

Andrej Karpathy said: "The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all." - https://twitter.com/karpathy/status/1797313173449764933

1 comments

altdataseller 729 days ago

But conversations in Slack aren’t your average webpage. Minus the channels used for automated messages/memes, a lot of in-depth, quality conversations happen on Slack on a large variety of topics

link

simonw 729 days ago

And it's all full of potentially private details.

Can you imagine the storm of bad publicity that would emerge the first time some company has details of an internal strategy leaked because some chatbot ended up parroting those details back to a competitor?

link