Hacker News new | ask | show | jobs
by ck2 1247 days ago
Datasets.

The one with the largest, most personal, most obtrusive, invasive dataset will probably win.

The one that has absorbed every podcast, every youtube video, every close-caption text in existence, will have the most "complete" answers.

2 comments

Hidden datasets can be replaced with model predictions collected from a public API. So they can be "exfiltrated" from the trained model. And we already maxed out on the accessible online text and the good quality sources.

What is going to make a difference is running models to generate more text for training, because relying on humans alone doesn't scale. For example we could be using LLMs to do brute force problem solving and then fine-tuning on solutions.

AlphaZero is the shining example of a model trained on its own generated data and surpassing us at our own game. The self generated data approach has potential to reach super human levels of performance.

How about illegal datasets like all the phone calls the NSA has been collecting domestically? Someone is going to train a private ChatGPT with that for queries.
Only legally gathered, absolutely "white" datasets could win, because gray/black methods of gathering lack feedback.

You have not methods to ensure, if gray/black really gather data or they faked it.