| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hevalon 87 days ago

Author here. The search algorithm was the easy part. The LLM already encodes domain knowledge from ML papers; it knows learning rate warmup helps with transformers, that batch size and learning rate are coupled. It converged on the winning GRPO config by iteration 1. Grid search needed 8 iterations.

The hard part was per-iteration GPU isolation. A botched run that leaves stale optimizer state or corrupted weights in memory will poison the next iteration. Each iteration needs a fresh CUDA runtime, fresh filesystem, fresh memory. No state leaks. That's where most of the engineering went; ephemeral containers with TTL-based cleanup, one A100 per iteration, torn down after metrics are emitted.

Happy to answer questions. Code: https://github.com/one-covenant/autoresearch-rl