|
Ive almost finished the first version of a desktop video library app I've been writing for myself. I had the idea last year, but the cost of sending images to an LLM made it too expensive (to run over about 1500 videos), but now it's fairly reasonable. In the app you pick a folder with videos in it and it stores the path, metadata, extracts frames as images, uses a local whisper model to transcribe the audio into subtitles, then sends a selection of the snapshots and the subtitles to an LLM to be summarised. The LLM sends back an XML document with a bunch of details about the video, including a title, detailed summary and information on objects, text, people, animals, locations, distinct moments etc. Some of these are also timestamped and most have relationships (i.e this object belongs to this location, this text was on this object etc). I store all that in a local SQLLite database and then do another LLM call with this summary asking for categories and tags, then store them in the DB against each video. The App UI is essentially tags you can click to narrow down returned videos. I plan on adding a natural language search (Maybe RAG -- need to look into the latest best way), have half added Projects so I can group videos after finding the ones I want, and have a bunch of other ideas for this too. I've been programming this with some early help from Aider and Claude Sonnet. It's getting a bit complex now, so I do the majority of code changes, though the AI has done a fair bit. It's been heaps of fun, and I'm using it now in "production" (haha - on my PC) |