Hacker News new | ask | show | jobs
by kevinstubbs 1050 days ago
> can they really look at a DOM tree and tell what it is/does

Yes, if you encode the DOM as a list of options for ChatGPT to choose from. In fact I developed a proof of concept of this for a client. https://jarvys.ai/ although they seem to have pivoted from automating just the browser to automating all software.

2 comments

Well if the DOM is all unstructured divs with no semantic information, can a human even tell what it means without applying the structural styling on the page?

A good example would be a misguided approach at making a bunch of labels with values that are aligned. Someone told this poor developer that <table> is bad, so they figure hey, let's use CSS to lay it out. They make a dictionary of the key/value pairs and iterate over all the keys in the first column into the first div and then output all the values in the second div.

div - label 1 - label 2

div - value 1 - value 2

If there's 100 key/values it's going to be hard for a human to figure out which value is for the 76th item, and LLMs have proven to be very bad at indexing problems like that so I wouldn't expect it to be a better story there.

(Not saying this wouldn't work in some cases, just couldn't be a general solution given the crap out there)

if you encode the DOM as a list of options for ChatGPT to choose from

Not sure if I understand this, does it mean you have to pre-cook DOM in a specific way? If yes, then isn’t the answer to my question “no”, like “no, it can’t take any site and use it as is”?

You have to give GPT an objective, like "find an apartment in Florida" and then say something like "given the following options, which one would you interact with to get closer to your objective."

So if you assume that you start on google.com, then your options are like 1.) Input with name "search", placeholder "search anything", value "" 2.) Button with label "I'm feeling lucky" 3.) Button with label "search"

Obviously, doing just one of these doesn't achieve the objective - it just needs to pick which one it thinks has the most "value" for completing the objective. If you repeat that enough times, then it can actually do what your overall goal of the session was.

I'm just giving a simplistic answer, and if you implemented only what I've written, then it's going to get stuck in a loop more often than not. But that's the gist of how you could encode the DOM into something that GPT can interpret and make decisions/take actions based on.

Remember HATEOAS? I have a feeling LLMs would excel at navigating proper REST (not "RESTful") APIs - HATEOAS is, in principle, just what you did here: providing a list of possible/useful next steps along with the response.

In fact, the problem of HATEOAS is exactly what LLMs seem to be good at - inferring the interface at runtime, from dynamically received metadata. This should even be easy to try in practice today - HATEOAS can be trivially mapped to the "function calling" feature of OpenAI's GPT-3.5/GPT-4 APIs.

Got it, thanks!