Hacker News new | ask | show | jobs
by sfink 1080 days ago
> Throwing machine learning at the estimation problem is a possibility, but I fear that it would still be difficult to come up with an estimation function that is (a) comprehensible and (b) accurate across many different machines.

Nick: you know me, I love to overcomplicate things.

I don't know if you would consider this to be machine learning, but the size estimation seems like a good fit[1] for multiple linear regression. You have some independent variables (number of MIR nodes, number of basic blocks) and an error in a dependent measure you care about (time to compile), so it seems worth running through a linear regression model to get rough weights for your independent variables. (eg 7 × number of BBs + number of MIR nodes or whatever.) Sure, there's the risk of overfitting to your machine and number of cores / amount of RAM / whatever. But if the size estimation is really what's throwing off your other experiments, it seems worth spending some time to come up with other possible inputs (independent variables) and cutting that error down.

You might consider using something other than squared error for the regression, since at this point you know a fair bit about the effects of different errors.

If you happen to know that the compiler does stuff that is quadratic in number of basic blocks, you might even want to consider a nonlinear term like the square of the number of basic blocks. (So instead of `time ~ nodes + BBs`, you'd have `time ~ nodes + BBs**2` or `time ~ nodes + BBs + BBs**2` but still just do linear regression.) The square of an independent variable could also be seen as a way of modeling "this factor matters, but only when it gets big". You can play with other powers like square roots, which are more like "the value of this factor matters, but large values aren't that different from each other".

But the more of those games you play, the more you're going to overfit, so you probably ought to extract weights from a subset of your data and withhold the rest for testing. It'll still overfit (to your computer, for example), but at least it'll be better.

Last thought: you've broken down compilation time into its different phases before, right? I'm thinking of things like register allocation. Is there an easy-to-compute independent variable related to that, like maybe "number of MIR instructions in regions of the control flow graph that have more than 4 live variables" or something? Probably not, that's probably only visible later. Number of loops / back edges in the AST? I don't know enough about how this stuff works.

[1] Pun not intended. Until I made it.