| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by boomanaiden154 1004 days ago
	One of the biggest things that seems to be holding back ML in compilers right now is dataset size. This model was only trained on a gigabyte of source code, 30+% of that synthetic. Even on much simpler models, there have been massive performance gains by just throwing more data at them. Some experimentation with the original MLGO inlining model on a much bigger data corpus doubled the code-size wins. LLMs have also been shown to perform better they more data they are fed [1]. 1. https://arxiv.org/abs/2203.15556

2 comments

isaacfung 1004 days ago

Quality matters just as much as quantity.

LIMA: Less Is More for Alignment https://arxiv.org/abs/2305.11206

AlpaGasus: Training A Better Alpaca with Fewer Data https://arxiv.org/abs/2307.08701

Textbooks Are All You Need II: phi-1.5 technical report https://arxiv.org/abs/2309.05463

link

uoaei 1004 days ago

What's holding them back is provable correctness.

It's possible, nay, mandatory to constrain the outputs of the model at each step of generation in order to guarantee that a given structure or grammar is adhered to. If you can fine-tune the model with these constraints in place you can offload a lot of the effort that the LLM otherwise has to perform in comprehending correctness so it has more capacity for generating good content. To be sure, quality and quantity of data are important, but it's all too easy to introduce subtle bugs that take years to tease out if you don't adhere to the right constraints.

link

boomanaiden154 1004 days ago

Most of the work in this space is not focused on neural compilation (having a ML model perform the transformation/entire compilation), but on replacing heuristics or phase ordering, where the issue of correctness falls back onto the compiler. For pretty much exactly the reasons you mentioned, neural compilation isn't really tractable.

This specific paper focuses on phase ordering, which should guarantee correctness, assuming the underlying transformations are correct. They do train the model to perform compilation, but as an auxiliary task.

link