Y
Hacker News
new
|
ask
|
show
|
jobs
by
mountainriver
491 days ago
Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps.
1 comments
iiJDSii
491 days ago
Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc?
link
mountainriver
491 days ago
Qwen2.5-vl seems to be the best right now by our tests.
UI-TARS by bytedance also has a good amount of pretraining.
Molmo is also very good at coordinates.
link