Hacker News new | ask | show | jobs
by pushfoo 783 days ago
> The problem is for videos that have no transcript.

Whisper or other models can help with that too, but remember to preprocess to cut silence. The dataset tends to include ads in the captions, which results in hallucinated in from silence.

You could also add a transcript-evaluation step which checks whether this actually looks like a step-by-step video, but I'd consider skipping it for cost and efficiency. Trying to be helpful by evaluating whether the video is instructions or not is added complexity where bugs and strange behavior can creep in.