Hacker News new | ask | show | jobs
by keefle 846 days ago
How would the results compare to:

1. Video frames are sampled (based on frame clarity)

2. The images are fed to OCR, with their content outputed as:

Frame X: <content of the frame>

3. The accomulated text is given to an average LLM (Mistral) and asked the same request mentioned by the author (creating a JSON file containing book information)

Wouldn't we get something similar? maybe if a more sophisticed AI is used? So the monopoly on Gemini Pro for video processing (specifically when it comes to handling text present inside the video) is not really a sustainable advantage? or am I missing something (as this is something beyond just a fancy OCR hooked into a LLM? as the model would be able to tell that this text is on a book for instance?)

1 comments

Sure, you can slice a video up into images and process them separately - that's apparently how Gemini Pro works, it uses one frame from every second of video.

But you still need a REALLY long context length to work with that information - the magic combination here is 1,000,000 tokens combined with good multi-model image inputs.

I see, but I was wondering about the partial transferability of this feature to other LLMs

But fair enough, context length is key in this scenario