Hacker News new | ask | show | jobs
by sandropuppo 511 days ago
ByteDance has open-sourced UI-TARS, delivering a remarkable 33% improvement over OSWorld. The technical report also includes a comprehensive benchmark evaluation across web, desktop, and mobile platforms.

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.

UI-TARS brings together all four key components of CUA agents - perception, reasoning, grounding, and memory - into a single, unified vision-language model (VLM).