|
|
|
|
|
by AlexanderTheGr8
1268 days ago
|
|
I love to talk about this stuff as well. My stack is video -> extract audio -> whisper for transcript -> break it into segments -> create embddings for each segment -> get query from user -> get query embeddings -> compare and show the best results There are a couple other tricks I use such as an overlap window for segments and a little post-processing for better results from the comparisons; but overall, this is the gist of it. An issue with this approach is that ques-ans doesn't work as well as I'd like bec question and answer don't necessarily have similar embeddings ("What's the dog doing?", "He's sleeping" can be 2 completely independent sentences). So I would love to investigate more into Flan-T5 for this. I am "AlexanderTheGreat#9743" on Discord if you want to discuss more. |
|