Hacker News new | ask | show | jobs
by wholehog 567 days ago
We're talking 16 GPUs for ~6 hrs for inference, and 48 hrs for pre-training. This is not an exorbitant amount of compute.

A GPU costs $1-2/hr on the cloud market. So, ~$100-200 for inference, and ~$800-1600 for pre-training, which amortizes across chips. Cloud prices are an upper bound -- most CS labs will have way more than this available on premises.

In an industry context, these costs are completely dwarfed by the rest of the chip design process. (For context, the licensing costs alone for most commercial EDA software are in the millions of dollars.)

2 comments

You are correct. For commercial use, the GPUs used for training and fine-tuning aren't a problem financially. However, if we wanted to rigorously benchmark AlphaChip against simulated annealing or other floorplanning algorithms, we have to afford the same compute and runtime budget to each algorithm. With 16 GPUs running for 6 hours, you could explore a huge placement space using any algorithm, and it isn't clear if RL will outperform the other ones. Furthermore, the runtime of AlphaChip as shown in the Nature paper and ISPD was still significantly greater than Cadence's concurrent macro placer (even after pre-training, RL requires several hours of fine-tuning on the target problem instance). Arguably, the runtime could go down with more GPUs, but at this point, it is unclear how much value is coming from the policy network / problem embedding vs the ability to explore many potential placements.
> However, if we wanted to rigorously benchmark AlphaChip against simulated annealing or other floorplanning algorithms, we have to afford the same compute and runtime budget to each algorithm.

The Nature authors already presented such a study in their appendix:

"To make comparisons fair, we ran 80 SA experiments sweeping different hyperparameters, including maximum temperature (10^−5, 3 × 10^−5, 5 × 10^−5, 7 × 10^-5, 10^−4, 2 × 10^−4, 5 × 10^−4, 10^−3), maximum SA episode length (5 × 10^4, 10^5) and seed (five different random seeds), and report the best results in terms of proxy wirelength and congestion costs in Extended Data Table 6"

Non-paywalled Nature article link: rdcu.be/cmedX

You're saying that if the other methods were given the equivalent amount of compute they might be able to perform as well as AlphaChip? Or at least that the comparison would be fairer?

Are the other methods scalable in that way?

Existing mixed-placement algorithms depend on hyperparameters, heuristics, and initial states / randomness. If afforded more compute resources, they can explore a much wider space and in theory come up with better solutions. Some algorithms like simulated annealing are easy to modify to exploit arbitrarily more compute resources. Indeed, I believe the comparison of AlphaChip to alternatives would be fairer if compute resources and allowed runtime were matched.

In fact, existing algorithms such as naive simulated annealing can be easily augmented with ML (e.g. using state embeddings to optimize hyperparameters for a given problem instance, or using a regression model to fine-tune proxy costs to better correlate with final QoR). Indeed, I strongly suspect commercial CAD software is already applying ML in many ways for mixed-placement and other CAD algorithms. The criticism against AlphaChip isn't about rejecting any application of ML to EDA CAD algorithms, but rather the particular formulation they used and objections to their reported results / comparisons.

That sounds like future work for simulated annealing fans to engage in, quite honestly, rather than something that needs to be addressed immediately in a paper proposing an alternative method. The proposed method accomplished what it set out to do, surpassing current methods; others are free to explore different hyperparameters to surpass the quality again... This is, ultimately, why we build benchmark tasks: if you want to prove you know how to do it better, one is free to just go do it better instead of whining about what the competition did or didn't try on one's behalf.
Yes, they are. The other approaches usually look like simulated annealing, which has several hyperparameters that control how much computing is used and improve results with more compute usage.
See my comment above - the Nature authors already did this, and tried a huge hyperparameter sweep for SA, and RL still won.

See appendix of the Nature article: rdcu.be/cmedX

I understand and have read the article. Running 80 experiments with a crude form of simulated annealing is at most 0.0000000001% of the effort that has been spent on making that kind of hill climb work well by traditional EDA vendors. That is also an in-sample comparison, where I would believe the Google thing pre-trained on Google chips would do well, while it might have a harder time with a chip designed by a third party (further from its pre-training).

The modern versions of that hill climb also use some RL (placing and routing chips is sort of like a game), but not in the way Jeff Dean wants it to be done.

> it might have a harder time with a chip designed by a third party (further from its pre-training).

Then they could pre-train on chips that are in-distribution for that task.

See also section 3.1 of their response paper, where they describe a comparison against commercial autoplacers: https://arxiv.org/pdf/2411.10053

The Google internal paper by Chatterjee and the Cheng et al paper from UCSD made such comparisons with Simulated Annealing. The annealer in the Nature paper was easy to improve. When given the same time budget, the improved annealer produced better solutions than AlphaChip. When you give both more time, SA remains ahead. Just read published papers.
The UCSD paper didn't run the Nature method correctly, so I don't see how you can draw this conclusion.

From Jeff's tweet:

"In particular the authors did no pre-training (despite pre-training being mentioned 37 times in our Nature article), robbing our learning-based method of its ability to learn from other chip designs, then used 20X less compute and did not train to convergence, preventing our method from fully learning even on the chip design being placed."

As for Chatterjee's paper, "We provided the committee with one-line scripts that generated significantly better RL results than those reported in Markov et al., outperforming their “stronger” simulated annealing baseline. We still do not know how Markov and his collaborators produced the numbers in their paper."

h100 GPU instances are multiple orders of magnitude more expensive.
Not true, H100s cost $2-3/GPU/hr on the open market.
Yes, they even do at $1/GPU/hr. However, 8xH100 cluster at full utilization is ~8kWh of electricity and costs almost ~0.5M$. 16xH100 cluster is probably 2x of that. How many years before you break-even at ~24$/GPU/day income?
Did you really not understand rethoric nature of my question and assumed that I can't do 1st grade primary school math?
Who cares? That's someone else's problem. I just pay 2-3$/hr and the H100s are usable
You should care about counterparty risks. If your business model depends on unsustainable 3rd party prices powered by VC largesse and unrealizable dreams of dominance, the very least you can do is plan for the impending reckoning, after which GPU proces will be determined by costs.
Look, I understand that some people are short-sighted and can hardly think out of the box and that is totally fine by me. I don't judge you for being that so I kindly ask you not to judge my question. Learn to give some benefit of the doubt.
H100 GPUs are more or less similar in price/performance. It is 2-3x more expensive per hour for 2-3x higher performance.