| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xatalytic 1268 days ago

Curious if you can share more about the stack. From another comment it sounds like you're using Whisper to generate the text from audio.

My default out of the box way to approach this would be something straightforward like a BERT-alike encoder to embed each target sentence in a FAISS index (hell, podcasts aren't long -- it could be brute force lookup, I suppose) or similar, with the same encoder running on the queries.

Something I've been playing with is Flan-T5 (https://huggingface.co/docs/transformers/model_doc/flan-t5), which has really strong out of the box question answering capabilities. I could see chunking in larger blocks and using the blocks as a context passage and the query as a question-oriented prompt. I've run some fine-tuning experiments with this setup for text generation (e.g. write me a summary of Huberman's key takes on dopamine) and find that the Flan-T5 model forgets a lot of its other capabilities when subject to fine tuning.

In any event, understand if you're not inclined to share, but love talking shop on this stuff.

1 comments

AlexanderTheGr8 1268 days ago

I love to talk about this stuff as well. My stack is video -> extract audio -> whisper for transcript -> break it into segments -> create embddings for each segment -> get query from user -> get query embeddings -> compare and show the best results

There are a couple other tricks I use such as an overlap window for segments and a little post-processing for better results from the comparisons; but overall, this is the gist of it.

An issue with this approach is that ques-ans doesn't work as well as I'd like bec question and answer don't necessarily have similar embeddings ("What's the dog doing?", "He's sleeping" can be 2 completely independent sentences). So I would love to investigate more into Flan-T5 for this.

I am "AlexanderTheGreat#9743" on Discord if you want to discuss more.

link

transitivebs 1268 days ago

The biggest question I have after building something similar is: what's the best way to break up transcripts into segments? You want the segments to be long enough to extract useful semantic info, but you don't want them to be too long either.

link

AlexanderTheGr8 1268 days ago

60 sec segments with 30 sec overlap window seems to work quite well for me but YMMV

link

xatalytic 1268 days ago

I’ll shoot you a line. As it happens I just came across Haystack (6.5k stars on GitHub) which looks like an awesome stack for this class of work.

https://haystack.deepset.ai/

link