|
|
|
|
|
by xatalytic
1268 days ago
|
|
Curious if you can share more about the stack. From another comment it sounds like you're using Whisper to generate the text from audio. My default out of the box way to approach this would be something straightforward like a BERT-alike encoder to embed each target sentence in a FAISS index (hell, podcasts aren't long -- it could be brute force lookup, I suppose) or similar, with the same encoder running on the queries. Something I've been playing with is Flan-T5 (https://huggingface.co/docs/transformers/model_doc/flan-t5), which has really strong out of the box question answering capabilities. I could see chunking in larger blocks and using the blocks as a context passage and the query as a question-oriented prompt. I've run some fine-tuning experiments with this setup for text generation (e.g. write me a summary of Huberman's key takes on dopamine) and find that the Flan-T5 model forgets a lot of its other capabilities when subject to fine tuning. In any event, understand if you're not inclined to share, but love talking shop on this stuff. |
|
There are a couple other tricks I use such as an overlap window for segments and a little post-processing for better results from the comparisons; but overall, this is the gist of it.
An issue with this approach is that ques-ans doesn't work as well as I'd like bec question and answer don't necessarily have similar embeddings ("What's the dog doing?", "He's sleeping" can be 2 completely independent sentences). So I would love to investigate more into Flan-T5 for this.
I am "AlexanderTheGreat#9743" on Discord if you want to discuss more.