Hacker News new | ask | show | jobs
by mattnewport 2712 days ago
What's unsatisfactory is the way these companies present these results which implies a level of accuracy and precision which is unwarranted.

"She also has French and German ancestry (2.6 per cent) that her sister doesn't share."

I assume from this that 23andMe is reporting the results to this level of accuracy (a tenth of a percent). For the sister who got 2.6% they are strongly implying a certainty that she has a small amount of French and German ancestry rather than what actually seems to be the case which is a small chance of some French and German ancestry.

Since it seems their results are quite lacking in both precision and accuracy they should do a better job of reporting them and advertising them in a way that makes that clear.

2 comments

I can only speak for myself, but when I got one of these tests 6 years ago, I never got such an impression. They've updated my results several times since and I never feel hoodwinked as if I had been previously deceived; I'm always just curious to see what the latest estimates are.

> I assume from this that 23andMe is reporting the results to this level of accuracy (a tenth of a percent). For the sister who got 2.6% they are strongly implying a certainty that she has a small amount of French and German ancestry rather than what actually seems to be the case which is a small chance of some French and German ancestry.

You picked out the single biggest discrepancy they report in this article, and yet even that doesn't seem to indicate an actual inconsistency to me. I would tend to interpret those results to mean that for one twin, they felt sufficiently confident to call a portion of their DNA "French and German", while in the other it fell below that confidence threshold so it got thrown into the "Broadly European" share.

This process already pretty much precludes objective perfection just based on what they're reporting. What does it mean for your ancestry to come 2.6% come from France? During what time period? What if it's on the border in one of those regions that switched hands several times? These are subjective decisions they have to make but it doesn't invalidate the whole test because occasionally you have enough evidence to call portion of your DNA "French and German" and sometimes you have to fallback to just calling it "Broadly European".

> This process already pretty much precludes objective perfection just based on what they're reporting. What does it mean for your ancestry to come 2.6% come from France? During what time period? What if it's on the border in one of those regions that switched hands several times? These are subjective decisions

I think that’s the whole point of the complaint here. These companies pretend to have found overly precise results (which I’m sure are explained away to nothing in the fine print) and yet are subjective and lean heavily on guesswork.

23andMe told me that I was a quarter Italian. Seemed to have mixed that up with what I actually am: 50% middle eastern. For a while I wondered if there was some funny business in the family, but then my sister got the "right" percentages. This seems like a pretty huge error to me.
I know it might can be assumed from the context of the post and thread but are you talking about your twin sister or not? Just pointing out that the chance for funny business (in your family or at your birth or at these companies) is unknown but not necessarily zero.
Well, Italian and ME populations were basically the same thing for more than a thousand years (basically until the Islamic Conquest promoted piracy on the Mediterranean, making it fundamentally unsafe). So it might well be that you are carrying genes that most of the Italian population also carries, and viceversa.
Is it possible one of you has a father that's 100% middle Eastern, and the other has a father 50% italian and 50% middle Eastern?
I would argue that this is why significant digits count.

You can't say something is 2.1374cm long when your error is +/-1cm. You can say it is 2cm long though.

Actually, you can. It's just a custom to do differently.

But if you estimate your first moment is 2.1374 and your second moment is 1, I want to know just that, because it is the numbers I'll plug in my equations. I will decide to round if and when necessary

You're implying that by rounding before handing off the numbers, valuable information is lost. Actually, the reverse is true - the rounding itself contains valuable information (how many digits are actually relevant) which is a hidden property of the process that obtained them. Unless you provide it through a side channel, failing to round hides this information, for no benefit at all (because the digits you provide are known to be meaningless).

The only time this wouldn't be true is if you mistrust the significant figure calculation of the source, and believe you have sufficient insight to do better.

Yes, I do. In general, I stick to simple things. Enough samples and everything is normal, god bless the law of large numbers.

If the precision of your estimation is not a direct function of the standard deviation, but is a "hidden property of the process that obtained" it, we have much bigger problems that losing "valuable information"

> If the precision of your estimation is not a direct function of the standard deviation, but is a "hidden property of the process that obtained" it

I think you're confusing different types of error. There is error between measurements and an inherent error to the device you use to measure. There's also a difference between precision and accuracy.

Standard deviation is the difference in multiple measurements. For example if you measure something 10 times to be 51mm, then your standard deviation is 0.

But that doesn't mean you have no error.

The "property of the process that obtained it" is not hidden. A simple case is a ruler. You have lines on the ruler that tell you certain intervals. If the smallest interval on your ruler is 1mm, then all your calculations can be made to +/- 1mm (that is, up to 30.5cm on a standard 12in ruler). There is nothing hidden about this. All that is being said here is that your measuring device is not perfect.

So using the two errors, we have a measurement of 51mm +/- 1mm (or frequently in a short hand you'd just say 51mm). It would in fact be deceptive to say that your measurement was 51.0mm, because that implies that you have more precision than you actually have (implying that you have on the order of +/- 0.1mm precision).

Assuming we are talking about software, error between measurements is a direct function of the device you use to measure, which is itself close to perfect

Even if we go to the example you give, the measurement should be done n times, each reporting the exact result found like 51.0 51.9 51.95 etc. Even if the decimals are outside the smallest interval of your ruler: take enough of them and you can get closer to the actual length which may be 51.55345 and that you would never have been able to measure anyway without a caliper

The best thing is you can even do that by resampling old measurements (a process called bootstrapping)

So yes, if you remove the tenth of millimeters, you lose information.

What's wrong is not the number, but that custom makes people think 51.0 means 51.0 +- 0.01 or anything else while it was never said like this.

Back in high school in physics we would get minus points if we indicated a too high precision in the numbers we used for calculations, it was considered plain wrong to say 2.232cm if you actually only were able to measure that it's roughly 2cm.
And that's good, because you'll get the wrong answer if you used the number with too many points (nit picking: more decimal points does not mean higher precision).

In fact, this is part of why you'll see physicists do all their reductions with variables and plug in numbers at the very end. This ensures that you doing get (what we could call) floating point errors. You don't have extra numbers hanging around (from real numbers like 1/9 or pi). There are also other benefits to doing this.

I'm not sure why anyone thinks significant digits in any way informs error rate, and it's a little hyperbolic to suggest 4 significant digits with a whole number error rate, when they did 1 significant digit with a ~3% to ?% error rate.
>I'm not sure why anyone thinks significant digits in any way informs error rate

In the majority of technical fields the standard is to round the result so that the significant digits do correspond to error rate.

https://www.ruf.rice.edu/~bioslabs/tools/data_analysis/error...

I'd expect that in academia for the most part. But this is a consumer product, so they aren't going to follow technical fields, and my hunch is that's for marketing reasons (it looks more authentic).

But even then, rounding to the error rate is a standard practice, but it certainly does mean it's wrong or even unethical, it's just pointless not to. Unless of course your motivation is driven by marketing/sales.

> it looks more authentic

One might call that deception