Hacker News new | ask | show | jobs
by breck 5469 days ago
Imagine you threw a single stone into the desert and asked your friend to go find it. It would be hard. Now imagine you threw 2 stones into the desert and asked your friend to go find them. It is twice as hard to find both stones as it is to find 1 stone. Imagine you threw 3 stones. It is 3 times as hard to find all 3 stones as it is to find 1 stone.

Now imagine that numbers are built out of stones. To "build" a 1, you only need 1 stone. But to "build" a 2, you need 2 stones. Thus, if you wanted to write a 3, you would have to go in the desert and find 3 stones. It's 3x as hard, and so you'd expect people to "build" 1/3 as many 3's as 1's, 1/5 as many 5's as 1's, and so on. Just as you'd expect there to be a lot more single story buildings than skyscrapers. It's easier to build a single story building.

Thus, the distribution is exactly what you'd expect. While it doesn't actually take stones to build numbers, we don't write the number 3 unless we have 3 of something. Unless you are lying. Which is why this is a great method of detecting fraud.

UPDATE: What do I mean when I say "3 times as hard"?

Imagine the desert is a rectangle of 10 squares. Kind of like a mancala board or a ladder on the ground. You start by stepping in square 1, and to get to square 10 you have to step through each square.

If there is only 1 rock, what are the odds that you'll have to walk all 10 steps to find it? This is the same thing as asking what are the odds that this rock is in square 10. The answer is 1/10 or 10%.

Now, if there are 3 rocks, what are the odds that you'll have to step into all 10 squares? Well, what are the odds that there's a rock in the last square? 26.1%, or approximately 3x as hard. It's interesting that it's not exactly 3x as hard, it's 2.61x as hard. Which makes the data in the OP seem even more logical since you'd expect 30.8% 1's given 11.8% 3's--the 32.62% actual number is not that far off.

2 comments

It is less than twice as hard to find both stones when you threw 2 stones than it is to find the only stone you threw when you threw only 1.

Suppose you are the guy looking for the stones. There are two stones in the desert. Everything being random but equal, you are twice as likely to run into a stone when there are two than when there is only one stone in the desert. Once you find the first stone, it is equally difficult to find the second stone as it is to find only one stone at the beginning (if you treat "finding a stone" as independent events where you don't learn about the location of subsequent stones).

So while the idea is interesting, the analogy is poor. I much prefer the wikipedia explanation which is similar to yours but much more logically rigorous: http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_expone...

Response to update: Now I feel that you are convoluting your analogy. Can multiple stones occupy the same square? How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"? I apologize, but your illustration has become completely lost to me.

> Can multiple stones occupy the same square?

You're right I should have clarified. If multiple stones could not occupy the same square, the odds would remain as I first explained them (3x, etc.). I think in my stones analogy and real life, stones should be able to occupy the same square. In fact, there should be a positive correlation (ie, given that there's a rock in this square, odds of a second rock being there go up).

> How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"?

The odds of coming across 3 units of a quantity are 3x as hard as coming across 1 unit. When we write numbers, we are either:

1) writing a truthful description of how many units we see/own/ate/taste/touch etc. (I ate 2 bagels, I earned $5, I ran 10 miles.)

2) lying.

By "lying", I'm including things like writing a novel. Maybe a better word is "imagining". With numbers, we are either writing down true observations or we are imagining them. It's just as easy to "imagine" $9 million in your bank account as it is to "imagine" $1 million, while truthfully finding $9 million in your bank account is a lot more difficult :). This is why Benford's law doesn't apply for "imagined" numbers. By using Benford's law, you can quickly classify a number set into either "real" or "imagined".

Ah. But there's the rub. What I find unintuitive about Benford's law is the non-random distribution of the most significant digit regardless of base or unit of measure. You propose that it's the "largeness" of a number that enforces Benford's law. While that may be in some ways true, it does not explain the transparency to base or unit of measure. You ate 2 bagels? I ate 4 half-bagels. You earned $5? I earned ¥600 motherfucker! You ran 10 miles? I ran 3bf3e6800 micrometers, in base-16!

Again, your train of thought is not necessarily wrong, but I still find the wikipedia explanation much more robust and illustrative. I hope this is where we can agree to disagree.

The reason why it only applies to the most significant digit is that I can say for certain that quantities of 1_ will appear ~2x as much as quantities in the 2_x family. However, I can't say whether numbers ending in 1 are more common than numbers ending in 6, because although 11 occurs more than any number higher than it, it makes up a minuscule proportion of the numbers ending in 1, and 16 occurs more than 21, 31, etc., so there's no clear way to predict what number will occur most in any digit but the most significant.

Thanks for offering your views. My analogy may be wrong or weak and maybe there is a better one to be found.

Every base is base 10. There's your answer.
I'm sorry, but your mathematical reasoning is very muddled, and the pattern you predict is wrong.

According to Benford's law the odds of a leading 1 are 1.709511291351... times the odds of a leading 2. This isn't the factor of 2 you thought it should be. The odds of a leading 1 are 2.409420839653... times the odds of a leading 3. This isn't the factor of 3 you thought it should be.

Yes, I know that it is fun to try to figure things out for yourself. But it is essential to learn when you're headed down the wrong path. That lets you correct your misconceptions before they cement and lead to severely wrong impressions of how to do things. Your whole desert/rock analogy? That's a wrong path.

Thank you, very interesting contributions to the conversation.

It looks to me though, that my line of reasoning(note I said more precisely 2.6x, I used 3x initially to simplify it) more closely matches the data than the numbers you provided.

I gave you exact numbers, not approximations from a small data set. Unless you match the numbers that I gave exactly, the numbers you give won't match what will be found in large datasets.
I understand that the law is a formula that can generate exact numbers. However, as experience shows, almost nothing generates these exact numbers. Almost nothing follows the formula exactly.

I think my explanation is about what is it that causes the law to occur. Not what is benfords , but rather, why does it occur. I know about the law scale And the picture on Wikipedia. It's neat, but I don't think it reveals the underlying cause. I think the cause could be simply that it is about 2x easier to find 1 unit than 2 units, 10 units than 20 units, 1000 units than 2000 units.

What can I say? Your explanation is wrong. It generates wrong numbers. And gives little to no insight as to why Benford's law works.

Benford's law will hold approximately for any set of numbers with the property that they are distributed over many orders of magnitude, from a distribution which doesn't change much if you multiply by a random number in some range.

An example of such a set of numbers is the set of numbers that come up in intermediate calculations involving a lot of different numbers. (This explains the logarithm books where the phenomena was first noticed.)

Another example are the numbers you see coming out of any sort of self-similar phenomena. As fractals show, self-similar behavior is ubiquitous. As a result numbers like the length of rivers, the height of hills, and the size of cities all tend to follow Benford's law.

For any particular source of numbers, the explanation for why they fall into a category that matches Benford's law will differ. Benford's law is a property that mathematical models tend to have, rather than being a rigorous mathematical theorem.

(FYI Benford's law is something that I've known about, and thought about off and on, for close to 20 years ago now.)

Really good comments here. I'll try and refine my position which I did a poor job of explaining and post something in the next few weeks or months.