| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pixelmonkey 1157 days ago
	The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.

2 comments

chii 1157 days ago

> If you point Whisper at YouTube, that's a whole new source of all sorts of text.

a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!

link

Salgat 1157 days ago

But how much more data is required to make a big difference? Is doubling the dataset considered a dramatic improvement? Or is increasing the dataset by 10x needed?

link

ospray 1157 days ago

Also quality is likely important will the models get better if we train them on YouTube comments.

link