Hacker News new | ask | show | jobs
by denhaus 516 days ago
I share your viewpoint on this, that DFT is a poor proxy model for ML to approximate.

However, the alternative of using, for example - experimental data, is that the synthesis procedures, measurement parameters, sample impurities, and even differences between experimental apparatus means training datasets of even modest size are insanely heterogeneous. So models either are either trained to predict differences between materials due to experimental discrepancies, trained on very small datasets, or must have a slew of post-hoc physics-based adjustments added to get reasonable numbers.

Higher order computational methods (including simply more intensive, non-high throughput DFT) are accurate but expensive as you know. Some of them have systematic error in the way DFT does, and are essentially based on user choice of (many!) parameters. Charged defect calculations are on example of this. Finding large (>10^4) training sets with similar parameters for computation is difficult. “ML” for these kinds of calculations usually consists of like, calculating a hundred (or 10) crystals within a narrow chemical system, doing a linear regression on one variable (eg, valence of cation on some site), and getting numbers +\- 10% of a “true” number.

GGA/meta-GGA DFT, on the other hand, can be applied at a sufficient fidelity to get real(ish) numbers in a homogenous way across huge numbers of crystals. So you are correct, you are predicting an approximate number for a property in many cases. But if we know the approximate number is wrong due to systematic error (and we can, in some situations) we can apply corrections or higher order methods to get the right(ish) answer. More, it’s highly dependent on which property you’re interested in. Some properties, like band gap, can be off by a lot. Others, like formation energy, can be calculated pretty accurately even with run-of-the-mill GGA DFT. Elastic moduli are generally ok.

in summary, approximating DFT with ML is just the least messy way to get real-ish answers across a large number of materials. Of course, there’s a point at which low-fidelity DFT calculations are - (1) so cheap and (2) so inaccurate, generally - that having an ML model approximate them is pointless. Most large DBs of materials now use good enough DFT that the numbers they calculate are not pointless for ML to learn from.

In the future, I think models trained on large numbers of DFT calculations will have to be applied to narrow sets of higher fidelity calculations by tuning. Much like you can fine tune a generalized LLM to do specific things. That might be where ML can actually bring real value to materials design.

Also, it’s worth considering that synthesizing novel materials can be insanely difficult. So 1 in 4 is not bad in my opinion.