|
|
|
|
|
by hevalon
87 days ago
|
|
Author here. The search algorithm was the easy part. The LLM already encodes domain knowledge from ML papers; it knows learning rate warmup helps with transformers, that batch size and learning rate are coupled. It converged on the winning GRPO config by iteration 1. Grid search needed 8 iterations. The hard part was per-iteration GPU isolation. A botched run that leaves stale optimizer state or corrupted weights in memory will poison the next iteration. Each iteration needs a fresh CUDA runtime, fresh filesystem, fresh memory. No state leaks. That's where most of the engineering went; ephemeral containers with TTL-based cleanup, one A100 per iteration, torn down after metrics are emitted. Happy to answer questions. Code: https://github.com/one-covenant/autoresearch-rl |
|