Hacker News new | ask | show | jobs
by viscanti 665 days ago
Because of how trivial that step is, it's likely pretty easy to just take lots of code and minify it. Then you have the training data you need to learn to generate full code from minified code. If your goal is to generate additional useful training data for your LLM, it could make sense to actually do that.
1 comments

I suspect, but definitely do not know, that all the coding aspects of llms work something like this. It’s such a fundamentally different problem from a paragraph, which should never be the same as any other paragraph. Seems to me that coding is a bit more like the game of go, where an absolute score can be used to guide learning. Seed the system with lots and lots of leetcode examples from reality, and then train it to write tests, and now you have a closed loop that can train itself.
If you're able to generate minified code from all the code you can find on the internet, you end up with a very large training set. Of course in some scenarios you won't know what the original variable names were, but you would expect to be able to get something very usable out of it. These things, where you can deterministically generate new and useful training data, you would expect to be used.