| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fzysingularity 213 days ago

We ran a small visual benchmark [1] of GPT, Gemini, Claude, and our new visual agent Orion [2] on a handful of visual tasks: object detection, segmentation, OCR, image/video generation, and multi-step visual reasoning.

The surprising part: models that ace benchmarks often fail on seemingly trivial visual tasks, while others succeed in unexpected places. We show concrete examples, side-by-side outputs, and how each model breaks when chaining multiple visual steps.

We go into more details in our technical whitepaper [3]. Play around with Orion for free here [4].

[1] Showdown: https://chat.vlm.run/showdown

[2] Learn about Orion: https://vlm.run/orion

[3] Technical whitepaper: https://vlm.run/orion/whitepaper

[4] Chat with Orion: https://chat.vlm.run/

Happy to answer questions or dig into specific cases in the comments.