| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pbkhrv 133 days ago

> How parallelism changed the agent’s research strategy > With a single GPU, the agent is stuck doing greedy hill-climbing: try one thing, check the result, pick a direction, try the next thing. With 16 GPUs, the strategy shifts. ...skip... 12 experiments in a single 5-minute wave. This makes it much harder to get stuck in local optima and much easier to find interaction effects between parameters.

The agent can theoretically come up with a protocol to run those same 12 experiments one-by-one and only then decide which branch to explore next - which I think would lead to the same outcome?

But in this case, it just happened to have stumbled on this particular outcome only because it didn't get a chance to execute a greedy strategy after the first 1 or 2 results.

Worse experiment design + parallelism = better experiment design + serialized execution ?

2 comments

gwern 133 days ago

> The agent can theoretically come up with a protocol to run those same 12 experiments one-by-one and only then decide which branch to explore next - which I think would lead to the same outcome?

At least in theory, adaptiveness should save samples and in this case, compute. (As noted, you can always turn the parallel into serial and so the serial approach, which gets information 'from the future', should be able to meet or beat any parallel approach on sample-efficiency.)

So if the batch only matches the adaptive search, that suggests that the LLM is not reasoning well in the adaptive setting and is poorly exploiting the additional information. Maybe some sort of more explicit counterfactual reasoning/planning over a tree of possible outcomes?

link

rfw300 133 days ago

Yeah, assuming there's no active monitoring during the training runs, you can trivially give the agent an abstraction which turns "1 GPU" into "16 GPUs" that just so happens to take 16x the wall-clock time to run.

link

rfw300 133 days ago

In fact, looking at the blog post, the agent orchestrating 16 GPUs is half as efficient as the agent using 1 GPU in GPU-time. Since it uses 16 GPUs to reach the same result as 1 GPU in 1/8 of the time.

link