Hacker News new | ask | show | jobs
by AlexanderTheGr8 1268 days ago
I love to talk about this stuff as well. My stack is video -> extract audio -> whisper for transcript -> break it into segments -> create embddings for each segment -> get query from user -> get query embeddings -> compare and show the best results

There are a couple other tricks I use such as an overlap window for segments and a little post-processing for better results from the comparisons; but overall, this is the gist of it.

An issue with this approach is that ques-ans doesn't work as well as I'd like bec question and answer don't necessarily have similar embeddings ("What's the dog doing?", "He's sleeping" can be 2 completely independent sentences). So I would love to investigate more into Flan-T5 for this.

I am "AlexanderTheGreat#9743" on Discord if you want to discuss more.

2 comments

The biggest question I have after building something similar is: what's the best way to break up transcripts into segments? You want the segments to be long enough to extract useful semantic info, but you don't want them to be too long either.
60 sec segments with 30 sec overlap window seems to work quite well for me but YMMV
I’ll shoot you a line. As it happens I just came across Haystack (6.5k stars on GitHub) which looks like an awesome stack for this class of work.

https://haystack.deepset.ai/