Hacker News new | ask | show | jobs
by shodai80 763 days ago
My point still stands. How do you augment data for an LLM when you know the context of a page? Do you go through every element and setup the data for an associated label? Do you use div scoping via offset parent through a script to generate associated div (good approach, bad in real-life conditions though)? Do you convert the DOM to JSON or some data structure? That means little because you still don't have context, you'd have to do it by hand every time the layout changes...and you would have to be very specific, which is a separate problem for modeling as layouts are modified. What if the UI can be modified to have different layout types, such as label above, label to side, label below...where this can be dynamically set.

What I am pointing here is, even data modeling is mostly irrelevant unless you want to go through every page/permutation of a page...all the while hoping the layout isn't modified or back to training all over again...which is downtime, and at some point you'll realize its just better to store user created xpath's, as its quicker to update those than retrain.

How do you reason with an LLM without going through any of the above? Automation cannot consistently have downtime for retraining, it's the antithesis for its purpose.

Let's not even get into shadow dom issues.

I am keying on your third bullet point on Github:

"How can you inform a text-only LLM about the page's visual structure?"

My questions suggest a gap in your awesome accomplishment.

1 comments

We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:

[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

Provided screenshots below do not show textboxes, selects, or other input nodes with labels. Show me text output with associated labels for inputs being correct and I will be shocked.
They do show textboxes with labels. From our readme:

"Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)

[@ID]: hyperlinks (<a> tags)

[$ID]: other interactable elements (e.g. button, select)

[ID]: plain text (if you pass tag_text_elements=True)"

Do you see the search boxes labeled [#4] and [#5] at the top? And before you say that the tag is on a different line from the placeholder text—yes, and our agent is smart enough to handle that minor idiosyncrasy. Are you shocked? :)

#4 and #5 are using placeholder attributes, and the text itself is contained within the node. Show me a simple form with labels external of an input node, then rearrange the labels to be some above and some below, and I will be shocked! No placeholders. Label must be its own 'text' node.

Edit: I do not intend to come off as negative or disparaging - I already discussed this with some OS projects I work on as well as internally at work. You guys did something great, and I am just trying to point out gaps that could take it from great to unbelievable.

This problem isn't that hard, screen readers had to handle this exact issues for years. Inaccessible websites where the labels aren't properly associated with their respective form fields do exist, but aren't that common.
Yes if they are associated with accessibility attributes (Aria). Many, many sites including massive B2B do not do this (a shame). So no, you are seriously minimizing the problem. This approach would also be architecturally poorly thought out - The solution needs to not depend upon aria, nor any other non-global approach (Which this solution does so far).

Everything shown to me so far has been a solvable problem by scripts/xpath template/creation logic. I've handled all of this for over 10 years with one script. When I see it finding everything and associating them with correct external labels, then they have something. Otherwise I am concluding it non-functional and a long since solved problem where ML is over-engineering.