The lack of benchmarks and light demos have me skeptical... The methods seem interesting, and maybe does unlock something novel, but it's odd to go into so much depth on the methods and leave so much wanting in the results?
It's probably much worse than VLMs on the computer use benchmarks out there. A lot of those benchmarks would be very hard to complete without the intelligence that arises from text pretraining.