Hacker News new | ask | show | jobs
by shodai80 763 days ago
How do you know, for a specific webelement, what label it is associated with for a textbox or select?

For instance, I might want to tag as you did where elements are, but I still need an association with a label, quite often, to determine what the actual context of the textbox or select is.

1 comments

Tarsier provides a mapping of element number (eg: [23]) to xpath. So for any tagged item we're able to map it back to the actual element in the DOM, allowing for easy interaction with the elements on the page.
I understand that, I assume you are tagging the node and making a basic xpath to the node/attribute with your tag id. Understood. But how relevant is tagging a node when I have no idea what the node is actually for?

EX: Given a simple login form, I may not know if the label is above or below the username textbox. A password box would be below it. I have a hard time understanding the relevance to tagging without context.

Tagging is basically irrelevant to any automated task if we do not know the context. I am not trying to diminish your great work, don't get me wrong, but if you don't have context I don't see much relevance. Youre doing something that is easily scripted with xpath templates which I've done for over a decade.

This is where a LLM comes it. In a typical pipeline would tag a page, transform it into a textual representation and then pass it to an llm which would be able to reason about which field(s) are the one you're looking for much like a human.
My point still stands. How do you augment data for an LLM when you know the context of a page? Do you go through every element and setup the data for an associated label? Do you use div scoping via offset parent through a script to generate associated div (good approach, bad in real-life conditions though)? Do you convert the DOM to JSON or some data structure? That means little because you still don't have context, you'd have to do it by hand every time the layout changes...and you would have to be very specific, which is a separate problem for modeling as layouts are modified. What if the UI can be modified to have different layout types, such as label above, label to side, label below...where this can be dynamically set.

What I am pointing here is, even data modeling is mostly irrelevant unless you want to go through every page/permutation of a page...all the while hoping the layout isn't modified or back to training all over again...which is downtime, and at some point you'll realize its just better to store user created xpath's, as its quicker to update those than retrain.

How do you reason with an LLM without going through any of the above? Automation cannot consistently have downtime for retraining, it's the antithesis for its purpose.

Let's not even get into shadow dom issues.

I am keying on your third bullet point on Github:

"How can you inform a text-only LLM about the page's visual structure?"

My questions suggest a gap in your awesome accomplishment.

We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:

[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

Provided screenshots below do not show textboxes, selects, or other input nodes with labels. Show me text output with associated labels for inputs being correct and I will be shocked.