| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shwaj 509 days ago
	Can you say more about using RL at inference time, ideally with a pointer to read more about it? This doesn’t fit into my mental model, in a couple of ways. The main way is right in the name: “learning” isn’t something that happens at inference time; inference is generating results from already-trained models. Perhaps you’re conflating RL with multistage (e.g. “chain of thought”) inference? Or maybe you’re talking about feeding the result of inference-time interactions with the user back into subsequent rounds of training? I’m curious to hear more.

1 comments

stormfather 509 days ago

I wasn't clear. Model weights aren't changing at inference time. I meant at inference time the model will output a sequence of thoughts and actions to perform tasks given to it by the user. For instance, to answer a question it will search the web, navigate through some sites, scroll, summarize, etc. You can model this as a game played by emitting a sequence of actions in a browser. RL is the technique you want to train this component. To scale this up you need to have a massive amount of examples of sequences of actions taken in the browser, the outcome it led to, and a label for if that outcome was desirable or not. I am saying that by recording users googling stuff and emailing each other for decades Google has this massive dataset to train their RL powered browser using agent. Deepseek proving that simple RL ca be cheaply applied to a frontier LLM and have reasoning organically emerge makes this approach more obviously viable.

link

shwaj 509 days ago

Makes sense, thanks. I wonder whether human web-browsing strategies are optimal for use in a LLM, e.g. given how much faster LLMs are at reading the webpages they find, compared to humans? Regardless, it does seem likely that Google’s dataset is good for something.

link

stormfather 502 days ago

Take this example:

A human googles "how much does a tire cost?"

They pick out a website from search results, then nav within it to the correct product page and maybe scroll until the price is visible on screen.

Google captures a lot of that data on third party sites. From Perplexity:

Google Analytics: If the website uses Google Analytics, Google can collect data about user behavior on that site, including page views, time on site, and user flow.

Google Ads: Websites using Google Ads may allow Google to track user interactions for ad targeting and conversion tracking.

Other Google Services: Sites implementing services like Google Tag Manager or using embedded YouTube videos may provide additional tracking opportunities

So you can imagine that Google has a kajillion training examples that go: search query (which implies task) -> pick webpage -> actions within webpage -> user stops (success), or user backs off site/tries different query (failure)

You can imagine that even if an AI agent is super efficient, it still needs to learn how to formulate queries, pick out a site to visit, nav through the site, do all that same stuff to perform tasks. Google's dataset is perfect for this, huge, and unparalleled.

link