Hacker News new | ask | show | jobs
by nicklecompte 909 days ago
The problem is that it's not just "presentation": being sloppy about significant digits (or precision generally) early in the computation leads to bad statistical reasoning much later in the problem.

If your variable is x +/- 0.05, then 1/(x +/- 0.05) != 1/x +/- 0.05. If you're not careful about this when doing computations, you'll end up with answers that aren't actually meaningful. The computational implementation of these equations is only concerned with machine epsilon, but each one of those 20 database columns has a real-world +/- delta which isn't being correctly considered.

1 comments

But the error propagation is not transmitted by the significant number. x, y being written with the correct number of significant number will not lead to f(x, y) being written to the correct number of significant number.

Usually, the best approach is to propagate the uncertainty, for example by saving the uncertainty as another variable in the database and using it directly when the number is used. If you do that, there is no practical needs to lose time to format the numbers. Using significant numbers seems a "cheap trick" that risk to mislead you more often than help.

> But the error propagation is not transmitted by the significant number. x, y being written with the correct number of significant number will not lead to f(x, y) being written to the correct number of significant number.

Significant figures are not a convention for making your deliverable pretty. They have semantic meaning. Don't think about dumb rules from high school chemistry, think about the actual problem. There are two entwined sources of uncertainty I am referring to:

1) measurement uncertainty, due to a lack of precision in the instrument (or the quantity itself, e.g. many financial computations are not meaningful if they involve fractional cents)

2) computational uncertainty, which is exclusively due to algebraic propagation of measurement uncertainty

Far too many data scientists don't care about the first category of uncertainty because they don't care about where the data came from. And they don't even realize the second category is a problem.

Let's look at a specific example. Somebody tells us that they measured the side of a square as 1.0m. Their tape measure only went down to centimeters, so the uncertainty is +/- 0.01m. What is the area of this square? Let's look at it two ways:

1) The smallest possible side length is 1.00m - 0.01m = 0.99m, so the smallest possible area is 0.98m^2. The largest possible side length is 1.01m, so the largest possible area is 1.02m^2. Thus the area is 1.00m^2 +/- 0.02m^2.

2) The side length is (1.00m +/- 0.01m). So the area is

(1.00m +/- 0.01m)(1.00m +/- 0.01m) = 1.00m^2 +/- 0.02m^2 +/- 0.0001m^2 ~ 1.00m^2 +/- 0.02m^2

So the uncertainty is not +/- 0.01, it is +/- 0.02. This can add up quite dramatically. In general if you have x +/- delta, then f(x +/- delta) is not going to be f(x) +/- delta or f(x) +/- f(delta). It needs to be handled carefully.

You're talking like sig figs is error propagation, but it isn't. That's the point. The correct solution is error propagation (with appropriate estimates of the errors of the inputs), not arbitrarily rounding numbers at each step, which is a terrible waste of time and likely to introduce its own errors compared to error propagation.
> You're talking like sig figs is error propagation, but it isn't.

No, that is the exact opposite of what I said! For starters, "uncertainty" and "error" are not the same thing here. I am saying significant figures in a measurement encapsulates an inherent measurement-specific uncertainty conveyed by significant figures, and that this uncertainty must be considered when doing calculations with that measurement. Just like the person I responded to, I don't think you've thought about why significant figures actually exist in the first place.

> The correct solution is error propagation (with appropriate estimates of the errors of the inputs), not arbitrarily rounding numbers at each step

Nowhere in my comment did I arbitrarily round anything. I thoughtfully propagated the uncertainty, which is why it was +/- 0.01m in the measurement and +/- 0.02m^2 in the calculation.

The whole point of my argument is that uncertainties in calculated quantities can be rigorously determined from the uncertainty of the inputs, and measurement inputs have uncertainty determined by the significant figures. On the other hand, ignoring significant figures in calculations means we're ignoring a potential source of uncertainty in downstream analysis. If you think significant figures is about "arbitrarily rounding something" then you are thoughtlessly applying high school chemistry rules. Please read this carefully:

If I measure something with a meterstick that is broken down into centimeters, that measurement has an inherent uncertainty of either +/- 1cm or +/- 0.5cm - which one you use is a problem-specific choice similar to p95 or p99 for statistical significance (if it was a physical meterstick I'd choose 1cm because human eyeballs aren't very good; if it was laser inference I'd choose 0.5cm).

So if I am a data scientist with a database of direct measurements from a meterstick, each one has an inherent uncertainty of +/- 0.01m that's implied by the data source even if it's not in the database. This is the entire point of representing the data as 12.03m, 1.00m, etc, instead of 1.234m. If you represented a measurement as 1.234m that would imply your meterstick could measure decimeters, but it probably can't. So 1.234m isn't merely against the rules, it's inaccurate.

If you take a measured side length of 1.00m and say the calculated area is 1.00m^2, then naively someone might think the uncertainty in the area is +/- 0.01m^2 based on thoughtlessly applying dumb high school chemistry rules. But that's not true, the uncertainty in the calculated area is in reality +/- 0.02m^2. The measurement can be presented without an explicit +/- because the significant figures acts as a "shorthand" and we don't need to do calculations to estimate the uncertainty. But the calculation must present a calculated uncertainty.

Programmers and data scientists are lazy about significant figures because they don't care where the data is coming from, to them it's all doubles in a database, and significant figures is just a matter of rounding things correctly at the end. The area-of-a-square argument proves that this is a mistake.

I still don't get it.

You are explaining error propagation, but my point is that _if you are doing error propagation (as you should do if you want to do things properly), significant figures ARE just for making deliverable pretty_.

You are talking about measurement uncertainty. Measurement uncertainty is written x +- y, with y being the uncertainty.

If you don't do that and use significant digit instead, you lose the information and precision: 10.0 +- 0.1 is 10.0, 10.0 +- 0.2 is 10.0, 10.0 +- 0.3 is 10.0, ...

This is why the other person was talking about "arbitrarily rounding".

You should _never_ said "well, it's a measurement of 10.0 with a 0.2 precision, so I can write 10.0", you should _always_ write 10.0 +- 0.2 (in which case, you can also write 10 +- 0.2 or 10.000 +- 0.2, the significant digits have no impact on any future results). Writing 10.0 instead of 10.0 +- 0.2 is just a terrible practice that does not have much justification, 10 +- 0.2 is always a better way. (and my point is that the problem you have with the significant number disappear if you teach people to use a non-clumsy way)

(and, no, you should not do the distinction "it's a measurement, so it's written differently", because in practice, a lot of "measurements" are in fact already a transformation, and sometimes you cannot even know for sure yourself. For example, a temperature sensor will measure an electrical resistance (with a measurement uncertainty) and then convert it into a temperature, and according to you, it should not be written the same way, just for arbitrary reasons)

I don't think you understand what a measurement is. There's a very good, very short book that explains in more detail what I am talking about, in the context of physics experiments: https://www.amazon.com/Practical-Guide-Data-Analysis/dp/0521...