| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cch_ 2175 days ago

> To train the grounding model, we synthetically generate 295K single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens from a public android UI corpus.

Sounds like a decent size training set.

> A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end.

85.56%, 89.21%, and 70.59% don't seem impressive to me. I may be oversimplifying, but why can't you just fine-tuned a Transformer model to map sentences ("Now tap the right-top side of the screen") to a fixed set of commands ("Tap(MAX_WIDTH, 0)")?

I used transformers before for classification, and for other cases, and they are quite powerful when you have "enough" data; 295K / 178K / 25K seems ok to me, but even if it's not, why not synthesize more.