|
|
|
|
|
by karimf
66 days ago
|
|
Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response. In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0] [0] https://huggingface.co/blog/gemma4#video-understanding |
|
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.