Hacker News new | ask | show | jobs
by vighneshiyer 571 days ago
I have published an addendum to an article I wrote about AlphaChip (https://vighneshiyer.com/misc/ml-for-placement/) at the very bottom that addresses this rebuttal from Google and the AlphaChip algorithm in general.

In short, I think the Nature authors have made some reasonable criticisms regarding the training methodology employed by the ISPD authors, but the extreme compute cost and runtime of AlphaChip still makes it non-competitive with commercial autofloorplanners and AutoDMP. Regardless, I think the ISPD authors owe the Nature authors an even more rigorous study that addresses all their criticisms. Even if they just try to evaluate the pre-trained checkpoint that Google published, that would be a useful piece of data to add to the debate.

2 comments

In the conclusion of the article, you said: "While I concede that there are things the ISPD authors could have done better, their conclusion is still sound. The Nature authors do not address the fact that CMP and AutoDMP outperform CT with far less runtime and compute requirements."

One key argument in the rebuttal against the ISPD article is that the resources used in their comparison were significantly smaller. To me, this point alone seems sufficient to question the validity of the ISPD work's conclusions. What are your thoughts on this?

Additionally, I noticed that the neutral tone of this comment is quite a departure from the strongly critical tone of your article toward the AlphaChip work (words like "arrogance", "disdain", "hyperbole", "belittling", "hostile" for AlphaChip authors, as opposed to "excellent" for a Synopsys VP.) Could you share where this difference in tone originates?

> One key argument in the rebuttal against the ISPD article is that the resources used in their comparison were significantly smaller. To me, this point alone seems sufficient to question the validity of the ISPD work's conclusions. What are your thoughts on this?

I believe this is a fair criticism, and it could be a reason why the ISPD Tensorboard shows divergence during training for some RTL designs. The ISPD authors provide their own justification for their substitution of training time for compute resources in page 11 of their paper (https://arxiv.org/pdf/2302.11014).

I do not think it changes the ISPD work's conclusions however since they demonstrate that CMP and AutoDMP outperform CT wrt QoR and runtime even though they use much fewer compute resources. If more compute resources are used and CT becomes competitive wrt QoR, then it will still lag behind in runtime. Furthermore, Google has not produced evidence that AlphaChip, with their substantial compute resources, outperforms commercial placers (or even AutoDMP). In the recent rebuttal from Google (https://arxiv.org/pdf/2411.10053), the only claim on page 8 says Google VLSI engineers preferred RL over humans and commercial placers on a blind study conducted in 2020. Commercial mixed placers, if configured correctly, have become very good over the past 4 years, so perhaps another blind study is warranted.

> Additionally, I noticed that the neutral tone of this comment is quite a departure from the strongly critical tone of your article

I will openly admit my bias is against the AlphaChip work. I referred to the Nature authors as 'arrogant' and 'disdainful' with respect to their statement that EDA CAD engineers are just being bitter ML-haters when they criticize the AlphaChip work. I referred to Jeff Dean as 'belittling' and 'hostile' and using 'hyperbole' with respect to his statements against Igor Markov, which I think is unbecoming of him. I referred to Shankar as 'excellent' with respect to his shrewd business acumen.

Thank you for your thoughtful response. Acknowledging potential biases openly in a public forum is never easy, and in my view, it adds credibility to your words compared to leaving such matters as implicit insinuations.

That said, on page 8, the paper says that 'standard licensing agreements with commercial vendors prohibit public comparison with their offerings.' Given this inherent limitation, what alternative approach could have been taken to enable a more meaningful comparison between CT and CMP?

So I'm not sure what Google is referring to here. As you can see in the ISPD paper (https://vlsicad.ucsd.edu/Publications/Conferences/396/c396.p...) on page 5, they openly compare Cadence CMP with AutoDMP and other algorithims quantitatively. The only obfuscation is with the proprietary GF12 technology, where they can't provide absolute numbers, but only relative ones. Comparison against commercial tools is actually a common practice in academic EDA CAD papers, although usually the exact tool vendor is obfuscated. CAD tool vendors have actually gotten more permissive about sharing tool data and scripts in public over the past few years. However, PDKs have always been under NDAs and are still very restrictive.

Perhaps the Cadence license agreement signed by a corporation is different than the one signed by a university. In such a case, they could partner with a university. But I doubt their license agreement prevents any public comparison. For example, see the AutoDMP paper from NVIDIA (https://d1qx31qr3h6wln.cloudfront.net/publications/AutoDMP.p...) where on page 7 they openly benchmark their tool against Cadence Innovus. My suspicion is they wish to keep details about the TPU blocks they evaluated under tight wraps.

The UCSD paper says "We thank ... colleagues at Cadence and Synopsys for policy changes that permit our methods and results to be reproducible and sharable in the open, toward advancement of research in the field." This suggests that there may have been policies restricting publication prior to this work. It would be intriguing to see if future research on AlphaChip could receive a similar endorsement or support from these EDA companies.
Cadence in particular has been quite receptive to allowing academics and researchers to benchmark new algorithms against their tools. They have also been quite permissive with letting people publish TCL scripts for their tools (https://github.com/TILOS-AI-Institute/MacroPlacement/tree/ma...) that in theory should enable precise reproduction of results. From my knowledge, Cadence has been very permissive from 2022 onwards, so while Google's objections to publishing data from CMP may have been valid when the Nature paper was published, they are no longer valid today.
We're talking 16 GPUs for ~6 hrs for inference, and 48 hrs for pre-training. This is not an exorbitant amount of compute.

A GPU costs $1-2/hr on the cloud market. So, ~$100-200 for inference, and ~$800-1600 for pre-training, which amortizes across chips. Cloud prices are an upper bound -- most CS labs will have way more than this available on premises.

In an industry context, these costs are completely dwarfed by the rest of the chip design process. (For context, the licensing costs alone for most commercial EDA software are in the millions of dollars.)

You are correct. For commercial use, the GPUs used for training and fine-tuning aren't a problem financially. However, if we wanted to rigorously benchmark AlphaChip against simulated annealing or other floorplanning algorithms, we have to afford the same compute and runtime budget to each algorithm. With 16 GPUs running for 6 hours, you could explore a huge placement space using any algorithm, and it isn't clear if RL will outperform the other ones. Furthermore, the runtime of AlphaChip as shown in the Nature paper and ISPD was still significantly greater than Cadence's concurrent macro placer (even after pre-training, RL requires several hours of fine-tuning on the target problem instance). Arguably, the runtime could go down with more GPUs, but at this point, it is unclear how much value is coming from the policy network / problem embedding vs the ability to explore many potential placements.
> However, if we wanted to rigorously benchmark AlphaChip against simulated annealing or other floorplanning algorithms, we have to afford the same compute and runtime budget to each algorithm.

The Nature authors already presented such a study in their appendix:

"To make comparisons fair, we ran 80 SA experiments sweeping different hyperparameters, including maximum temperature (10^−5, 3 × 10^−5, 5 × 10^−5, 7 × 10^-5, 10^−4, 2 × 10^−4, 5 × 10^−4, 10^−3), maximum SA episode length (5 × 10^4, 10^5) and seed (five different random seeds), and report the best results in terms of proxy wirelength and congestion costs in Extended Data Table 6"

Non-paywalled Nature article link: rdcu.be/cmedX

You're saying that if the other methods were given the equivalent amount of compute they might be able to perform as well as AlphaChip? Or at least that the comparison would be fairer?

Are the other methods scalable in that way?

Existing mixed-placement algorithms depend on hyperparameters, heuristics, and initial states / randomness. If afforded more compute resources, they can explore a much wider space and in theory come up with better solutions. Some algorithms like simulated annealing are easy to modify to exploit arbitrarily more compute resources. Indeed, I believe the comparison of AlphaChip to alternatives would be fairer if compute resources and allowed runtime were matched.

In fact, existing algorithms such as naive simulated annealing can be easily augmented with ML (e.g. using state embeddings to optimize hyperparameters for a given problem instance, or using a regression model to fine-tune proxy costs to better correlate with final QoR). Indeed, I strongly suspect commercial CAD software is already applying ML in many ways for mixed-placement and other CAD algorithms. The criticism against AlphaChip isn't about rejecting any application of ML to EDA CAD algorithms, but rather the particular formulation they used and objections to their reported results / comparisons.

That sounds like future work for simulated annealing fans to engage in, quite honestly, rather than something that needs to be addressed immediately in a paper proposing an alternative method. The proposed method accomplished what it set out to do, surpassing current methods; others are free to explore different hyperparameters to surpass the quality again... This is, ultimately, why we build benchmark tasks: if you want to prove you know how to do it better, one is free to just go do it better instead of whining about what the competition did or didn't try on one's behalf.
Yes, they are. The other approaches usually look like simulated annealing, which has several hyperparameters that control how much computing is used and improve results with more compute usage.
See my comment above - the Nature authors already did this, and tried a huge hyperparameter sweep for SA, and RL still won.

See appendix of the Nature article: rdcu.be/cmedX

I understand and have read the article. Running 80 experiments with a crude form of simulated annealing is at most 0.0000000001% of the effort that has been spent on making that kind of hill climb work well by traditional EDA vendors. That is also an in-sample comparison, where I would believe the Google thing pre-trained on Google chips would do well, while it might have a harder time with a chip designed by a third party (further from its pre-training).

The modern versions of that hill climb also use some RL (placing and routing chips is sort of like a game), but not in the way Jeff Dean wants it to be done.

The Google internal paper by Chatterjee and the Cheng et al paper from UCSD made such comparisons with Simulated Annealing. The annealer in the Nature paper was easy to improve. When given the same time budget, the improved annealer produced better solutions than AlphaChip. When you give both more time, SA remains ahead. Just read published papers.
The UCSD paper didn't run the Nature method correctly, so I don't see how you can draw this conclusion.

From Jeff's tweet:

"In particular the authors did no pre-training (despite pre-training being mentioned 37 times in our Nature article), robbing our learning-based method of its ability to learn from other chip designs, then used 20X less compute and did not train to convergence, preventing our method from fully learning even on the chip design being placed."

As for Chatterjee's paper, "We provided the committee with one-line scripts that generated significantly better RL results than those reported in Markov et al., outperforming their “stronger” simulated annealing baseline. We still do not know how Markov and his collaborators produced the numbers in their paper."

h100 GPU instances are multiple orders of magnitude more expensive.
Not true, H100s cost $2-3/GPU/hr on the open market.
Yes, they even do at $1/GPU/hr. However, 8xH100 cluster at full utilization is ~8kWh of electricity and costs almost ~0.5M$. 16xH100 cluster is probably 2x of that. How many years before you break-even at ~24$/GPU/day income?
Did you really not understand rethoric nature of my question and assumed that I can't do 1st grade primary school math?
Who cares? That's someone else's problem. I just pay 2-3$/hr and the H100s are usable
You should care about counterparty risks. If your business model depends on unsustainable 3rd party prices powered by VC largesse and unrealizable dreams of dominance, the very least you can do is plan for the impending reckoning, after which GPU proces will be determined by costs.
Look, I understand that some people are short-sighted and can hardly think out of the box and that is totally fine by me. I don't judge you for being that so I kindly ask you not to judge my question. Learn to give some benefit of the doubt.
H100 GPUs are more or less similar in price/performance. It is 2-3x more expensive per hour for 2-3x higher performance.