Hacker News new | ask | show | jobs
by pixelmonkey 1157 days ago
The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.
2 comments

> If you point Whisper at YouTube, that's a whole new source of all sorts of text.

a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!

But how much more data is required to make a big difference? Is doubling the dataset considered a dramatic improvement? Or is increasing the dataset by 10x needed?
Also quality is likely important will the models get better if we train them on YouTube comments.