|
|
|
|
|
by mkauffman23
326 days ago
|
|
In this blog we detail the api design and technical decisions we made when adding audio video support to Ragie's RAG service. We explore some of the approaches we tried and the rationale behind what we landed on. Worth a read if you're building similar systems. Here's a TLDR:
- Built a full pipeline that processes audio/video → transcription + vision descriptions → chunking → indexing
- Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper)
- Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results)
15-second video chunks hit the sweet spot for detail vs context
- Source attribution with direct links to exact timestamps Happy to answer any further questions folks might have! |
|