Hacker News new | ask | show | jobs
by carpo 32 days ago
This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally.
2 comments

Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.

There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).

https://openrouter.ai/google/gemma-4-31b-it

Cheers, I'll give it a try. How are those models at returning structured results? When I was writing the prompts for the analysis step and testing with older Claude models, it would have trouble structuring the XML consistently. Sonnet 4.6 handles it really well.
Use function calling/tool use, not XML output. The models are all trained for that now.

Ie, instead of telling it to generate

  <name>Name</name>
  <age>19</name>
  <address>whatever</name>
give it a function

  details(name: string, age: int, address: string)
That is actually a JSON schema, and the models do great at it. Here's the claude docs, but they are all similar: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
Hey, just want to thank you for this suggestion. Spent this morning swapping to open router and changing all my prompts to use tools instead of XML. Not only is Gemma and Gemini much cheaper, the output tokens from the tool call are much less too. Cost to analyse one 20 minute video with 10 snapshots went from $0.21 to $0.009, and I'm even sending full HD snapshots instead of the 960x540 ones I was sending before (to save costs). The results so far are pretty good. It looks like the larger images are giving the model more context, so in some cases making the cheaper models results better than the expensive models. I'm going to run this over a few hundred videos today and see how it goes in bulk!
Ha! So glad it helped you!

Very interested in the full run details.

Yeah, it's been awesome! I'm so excited about tool calls and function use, the possibilities are huge. I ran it over 1494 videos that range in length from a few seconds to over 3 hours. Total duration 260 hours and a total size of 3795 GB. I don't know exactly how long it took to run, as I found some bugs I needed to fix when processing mkv files, but it was probably around 24 hours in total. That wasn't all LLM requests, but also the local Whisper transcription and frame extraction / analysis. I used gemini-3.1-flash-lite-preview for the content analysis and tagging. Analysis cost $9.22 and Tagging cost $2.72 and the results seem great (for comparison, I did 885 videos a few weeks ago with Sonnet and it cost $130 in total). Gemini seems much less verbose than Sonnet, even with the same prompt, so the descriptions are much shorter, but they seem very good. The tagging is great. Another added bonus has been that with the larger screenshots being sent, the LLM can now read much more of the text it sees on screen. Some of my videos are top-down showing me drawing and writing, and now it picks that up, so it's all indexed and searchable. I tested a few models with the RAG Chat feature, and the best one so far is GPT4.1-Mini. Before, when asking questions about the library or a video it was around 4 cents each query, now its averaging about half a cent.
Very interesting. Thank you!
Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.

But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.

Could you share how your scene-detection picks the frames?

---

For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.

I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.

You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.

Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.