Hacker News new | ask | show | jobs
by beder 4985 days ago
A few comments on the stats themselves:

1. It looks like the total number of tickets in 2009 and 2010 is about 10% that of 2011. I'm guessing that there weren't actually ten times as many tickets given in 2011, so either the data is incomplete (as the author suggested), or there was a typo. If the data is incomplete, I'd suggest normalizing to the 2011 totals; otherwise, the 3-year average doesn't make much sense.

2. The scale of the "normalized" difference graphs (showing "Actual - Expected"). The formula given is

(actual - expected) / total * 1000 = normalized number

If this is the case, then since the scale goes to about +/- 5, the differences are very small (less than 1% away from what you'd expect!). But from eyeballing the data, that doesn't seem right.

In any case, a better scale might be to expect the data to be normally distributed, and scale the differences to # of standard deviations. (See, e.g., http://en.wikipedia.org/wiki/Normal_distribution#Standard_de...)

1 comments

1) Yes, there were far fewer tickets in the data for 2009 and 2010 than 2011.

2) The fact that the normalized numbers were so small was very unintuitive to me at first too, but the important thing to realize is that in that formula, you're dividing the difference, not actual value for the given day, by the total number for the year. When I first ran those numbers I was so confused by the output. I was originally thinking that I'd normalize it by saying "X percent of the total for that year," but since I was working with the differences, and not the actual values, the numbers were too small a fraction.

Either that or I made some huge mistake in my logic...

WRT the use of standard deviations, like I said in the post, I'm not a statistician, so I wasn't really sure what the canonical way of normalizing data was. I pretty much just made one up. Thanks for pointing that out. I'll look into using standard deviation for the next one. :)

In that case, you should divide by "expected", so you get a percentage difference for each day. (Normalizing by total for the year doesn't make sense, since imagine that there were 300 days per month instead of 30 - your numbers would be divided by 10 again, but the data you want to visualize would stay the same.)
I made an update using standard deviation and Z-score charts instead of my homebrew normalization function (see "Update"): http://robert.io/posts/4.html
Great point! Dividing by expected would have been a much nicer solution. I think the method I used still works though, just not as nicely. Am I wrong?
Z-Score is your friend for this kind of data normalization.

http://en.wikipedia.org/wiki/Standard_score

Thanks. It looks like I need to do a little self-study in statistics.