|
|
|
|
|
by carpo
32 days ago
|
|
This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally. |
|
There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).
https://openrouter.ai/google/gemma-4-31b-it