|
|
|
|
|
by sandropuppo
511 days ago
|
|
ByteDance has open-sourced UI-TARS, delivering a remarkable 33% improvement over OSWorld. The technical report also includes a comprehensive benchmark evaluation across web, desktop, and mobile platforms. UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules. UI-TARS brings together all four key components of CUA agents - perception, reasoning, grounding, and memory - into a single, unified vision-language model (VLM). |
|