| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by metadat 830 days ago

This is a brilliant and useful application of LLM technology, I'm impressed.

One question- On the backend, is it downloading each video CC (closed-caption) transcript and feeding that into a tuned prompt? What happens for videos where this is missing? Asking because I've noticed CC is occasionally unavailable for some YouTube videos.

If you cared to have a fallback, a potentially interesting experiment / solution for such cases is to download the video, extract the audio to a WAV file, then through the audio through Whisper [1] to generate the transcript. Using CPUa, it will still be incredibly intensive and slow, generally not much faster than real-time (e.g. a 5 minute clip will take on the order of ~5 minutes to complete transcription). However, with Whisper running on a fancy GPU it is insanely faster, between 100-200x faster, meaning even for long videos, generating the transcripts will complete in only a few seconds.

Great job @aka_sh!

[1] https://github.com/openai/whisper

p.s. Is there any chance you'd open source your code? Or do you plan to turn this into a business? The code itself is exactly a huge moat, and it'd be cool to see how you did this. Cheers.

p.p.s. stepify.tech app is currently crashing out to a heroku error page when I try to submit a YT link.

5 comments

aka_sh 830 days ago

Thank you! I'm getting the transcript through an API and feeding it to the GPT. For now, the fallback function for no captions is just to make something out of the description of the video. I really appreciate the suggestion, i'll experiment around using Whisper. Regarding open source or business. I don't really know about that yet. Maybe, i'll lean towards the business side to cover the costs and see where this goes. And sorry for the downtime! API credits ran out. It should be fixed by now

link

metadat 830 days ago

Eek, so many typos in my comment - but the most egregious was where I meant to convey the code itself is not a huge moat. Even still, no worries if you don't want to give it away, I totally understand.

Keep up the good execution.

link

cchance 829 days ago

Definitly try out whisper after splitting out the audio as a fallback, and don't forget their are other models like WhisperFast that might be slightly less accurate but less resource intesnive, and since your not publishing the captions themselves you don't need it to literally get every word perfect.

link

ravenstine 830 days ago

It's epic how well that works. Even with Whisper locally, most of what I throw at it becomes readable.

link

Yannael 829 days ago

Here an example of implementation you may find interesting (that also includes snapshots, and links back to original video) - https://github.com/Yannael/video2blogpost

link

redbell 829 days ago

Here is another resource on the same topic: https://news.ycombinator.com/item?id=39367264

link

j45 830 days ago

Comparing yt transcript to open whisper transcripts could be interesting if it could pick up on something extra.

There is limited need to reinvent the wheel to process audio when other things can be solved.

link

alvah 830 days ago

The suggestion was to use Whisper as a fallback where no YT transcript exists.

link

cchance 829 days ago

I mean if CC is missing you just run it through whisper/whisperfast and you've got CC.

link