Hacker News new | ask | show | jobs
by danishSuri1994 218 days ago
Really interesting direction. The node-based canvas feels like a more scalable abstraction for video automation than the usual chat-only interface. I’m curious how you’re handling long-form content where temporal context matters (e.g., emotional shifts, pacing, narrative cues).

Multimodal models are good at frame-level recognition, but editing requires understanding relationships between scenes, have you found any methods that work reliably there?

2 comments

Side note, just for context, since there seem to be primarily video hobbyists responding to the OP:

Node based workflows are typical in NLE software. See Fusion & Color panels in Davinci Resole, Fusion (color grading), etc. Industry folks will take to this node based canvas with ease.

Great question @danishSuri1994

hey, thanks for the comment!

we've actually found that multimodal models are surprisingly good at maintaining temporal context as well

that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding

for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.