Hacker News new | ask | show | jobs
by thaumaturgy 3078 days ago
MySQL is perfectly capable of calculating a linear regression for you, btw. In my case, I needed to be able to estimate trends from sparse time series data. Here's how you do that:

    SELECT
        @a_count := avg(count) as mean_count,
        @a_weeks := avg(`week`) as mean_weeks,
        @covariance := (sum(`week` * `count`) - sum(`week`) * sum(`count`) / count(`week`)) / count(`week`) as covariance,
        @stddev_count := stddev(`count`) as stddev_count,
        @stddev_week := stddev(`week`) as stddev_week,
        @r := @covariance / (@stddev_count * @stddev_week) as r,
        @slope := @r * @stddev_count / @stddev_week as slope,
        @y_int := @a_count - (@slope * @a_weeks) as y_int,
        @this_week_no := timestampdiff(WEEK, (select min(`date`) from dataset), curdate()) as this_week_no,
        @predicted := round(greatest(1, @y_int + (@slope * @this_week_no))) as predicted
    
    FROM (SELECT timestampdiff(WEEK, (select min(`date`) from dataset), `date`) as week, count(date) as count FROM dataset group by WEEK(date)) series;
   
I had to figure out how to translate the math into SQL, now you don't have to.

This performs well enough to be able to crunch tens of millions of rows of data in "reasonable time" on a wimpy VPS.

1 comments

Yes, but why? Right tool for the right job and all that...
Normally the reason you write anything beyond trivial SQL is because you only have a small amount of code to run and lots of data to run it over. Pushing the code to the data is more efficient than pulling the data to the code.

The latter might be conceptually cleaner (though it's debatable, relational is a fairly nice programming model and a lot more consistent and well-founded than object orientation, for one), but it's seldom optimal.

Three orders of magnitude or more speedups are not unexpected by pushing the code to the data.

You're basically spot on here. I have a bunch of rows that need to have trend data crunched and updated pretty frequently. Putting this into MySQL cut the server load quite a bit, and it isn't so much business logic that I feel bad about doing it.
I would guess access to the data is a decent reason to try. No need to pump the data elsewhere, if this is available.

I share your doubt that this is worth it, to be clear.

In this case the amount of data and the frequency with which it needs to be updated made handling this in MySQL more practical. I otherwise would've had to have some process querying and updating the records outside of MySQL and it turns out that that puts a whole lot more load on the server than if I just ask MySQL to do it.

And besides, what happened to the hacker ethos? "Because I can" should be justification enough. :-)

Is is cool to hear that it works! And just because one doubts something, doesn't mean one shouldn't necessarily try it. Especially if I'm the one doubting. :)