|
|
|
|
|
by cch_
2175 days ago
|
|
> To train the grounding model, we synthetically generate 295K single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens from a public android UI corpus. Sounds like a decent size training set. > A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end. 85.56%, 89.21%, and 70.59% don't seem impressive to me. I may be oversimplifying, but why can't you just fine-tuned a Transformer model to map sentences ("Now tap the right-top side of the screen") to a fixed set of commands ("Tap(MAX_WIDTH, 0)")? I used transformers before for classification, and for other cases, and they are quite powerful when you have "enough" data; 295K / 178K / 25K seems ok to me, but even if it's not, why not synthesize more. |
|