|
|
|
|
|
by boomanaiden154
1004 days ago
|
|
One of the biggest things that seems to be holding back ML in compilers right now is dataset size. This model was only trained on a gigabyte of source code, 30+% of that synthetic. Even on much simpler models, there have been massive performance gains by just throwing more data at them. Some experimentation with the original MLGO inlining model on a much bigger data corpus doubled the code-size wins. LLMs have also been shown to perform better they more data they are fed [1]. 1. https://arxiv.org/abs/2203.15556 |
|
LIMA: Less Is More for Alignment https://arxiv.org/abs/2305.11206
AlpaGasus: Training A Better Alpaca with Fewer Data https://arxiv.org/abs/2307.08701
Textbooks Are All You Need II: phi-1.5 technical report https://arxiv.org/abs/2309.05463