Hacker News new | ask | show | jobs
by christopheraden 4090 days ago
Try graphing y = -1 * log(x) and imposing a limit on the upper bound of x and you'll get close to what he has. Perhaps that's the angle he's coming from. He provided the fitted equation further down in the featured article, and the log term does have a negative coefficient, plus an intercept term.

The graph he plots looks like the data fits the Exponential Distribution: http://en.wikipedia.org/wiki/Exponential_distribution

3 comments

It screams exponential at me, especially given a potential underlying model where every sick person has a .x probability of getting every individual they work with sick. As the number of individuals goes up with no change in the rate of sickness from outside the office, the number of sick people should go up exponentially (as with any multiplicative process).

Edit: actually I think I completely misinterpreted the data. Now that I look more closely, I have no idea what the X axis is for. I assumed it was number of employees in a company whose sick time was somehow represented by bar height, but is it just a list of all employees sorted by how much sick time was taken?

If so, this is probably an example of a normal distribution with an exponential tale.

I'm pretty sure it's just a list of employees sorted by how much sick time is taken, so the X-axis is an "employee index number".

More interesting (and pertinent when trying to find a pattern in this data) would be a histogram for sick time taken. Trying to fit a curve to the graph as-is isn't useful, because the X-axis doesn't represent anything meaningful.

This is my thought as well. So you fit a curve to a sorted list of each employees sick time. Does this give you any additional insight? So it follows a log function. Does that mean anything?

If you do a histogram and fit a function you get something that could conceivably be interpreted as a probability distribution function, you might be able to say something about predicting the sick time a given employee will take and the uncertainty of your prediction.

But I honestly don't see what visualizing the data in the method of the post, or fitting a function to it contributes. Hope that doesn't violate the new no negativity policy of HN.

tale = tail. oy.
Since there is confusion in the sibling comments, I want to explain how y = - kln(x) + m fits in with the exponential function. I am going to be a little sloppy with closed and open intervals and round a little.

x is the rank of the employee. Let N be number of employees. We can generate a new observation from the model by generating an x' uniformly between 1 and N, and inserting in the formula for y. Then p'=x'/N is a number between 1/N and 1, or if we round, between 0 and 1.

The generated observation will be distributed according to (convince yourself by looking at the submission's graph)

P(y' > y) = p, where y=-kln(Np) + m

or solving for p, where p = exp(m-y) / N. So

P(y' < y) = 1 - exp(m-y) / N

This is the exponential distribution.

I also thought that it's exponential, and as the other comment says it makes more sense.

I don't have the data, but I superimposed an Excel graphic over the original graphic: http://imgur.com/L5f6CIa

The logarithmic fit looks better.