|
|
|
|
|
by keefle
846 days ago
|
|
How would the results compare to: 1. Video frames are sampled (based on frame clarity) 2. The images are fed to OCR, with their content outputed as: Frame X: <content of the frame> 3. The accomulated text is given to an average LLM (Mistral) and asked the same request mentioned by the author (creating a JSON file containing book information) Wouldn't we get something similar? maybe if a more sophisticed AI is used? So the monopoly on Gemini Pro for video processing (specifically when it comes to handling text present inside the video) is not really a sustainable advantage? or am I missing something (as this is something beyond just a fancy OCR hooked into a LLM? as the model would be able to tell that this text is on a book for instance?) |
|
But you still need a REALLY long context length to work with that information - the magic combination here is 1,000,000 tokens combined with good multi-model image inputs.