|
|
|
|
|
by pixelmonkey
1157 days ago
|
|
The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023. |
|
a lot of YT videos already has autogenerated english subtitles, which is actually available as a vtt download, so don't even need to use Whisper on a video to obtain it!