Hacker News new | ask | show | jobs
by cauch 909 days ago
But the error propagation is not transmitted by the significant number. x, y being written with the correct number of significant number will not lead to f(x, y) being written to the correct number of significant number.

Usually, the best approach is to propagate the uncertainty, for example by saving the uncertainty as another variable in the database and using it directly when the number is used. If you do that, there is no practical needs to lose time to format the numbers. Using significant numbers seems a "cheap trick" that risk to mislead you more often than help.

1 comments

> But the error propagation is not transmitted by the significant number. x, y being written with the correct number of significant number will not lead to f(x, y) being written to the correct number of significant number.

Significant figures are not a convention for making your deliverable pretty. They have semantic meaning. Don't think about dumb rules from high school chemistry, think about the actual problem. There are two entwined sources of uncertainty I am referring to:

1) measurement uncertainty, due to a lack of precision in the instrument (or the quantity itself, e.g. many financial computations are not meaningful if they involve fractional cents)

2) computational uncertainty, which is exclusively due to algebraic propagation of measurement uncertainty

Far too many data scientists don't care about the first category of uncertainty because they don't care about where the data came from. And they don't even realize the second category is a problem.

Let's look at a specific example. Somebody tells us that they measured the side of a square as 1.0m. Their tape measure only went down to centimeters, so the uncertainty is +/- 0.01m. What is the area of this square? Let's look at it two ways:

1) The smallest possible side length is 1.00m - 0.01m = 0.99m, so the smallest possible area is 0.98m^2. The largest possible side length is 1.01m, so the largest possible area is 1.02m^2. Thus the area is 1.00m^2 +/- 0.02m^2.

2) The side length is (1.00m +/- 0.01m). So the area is

(1.00m +/- 0.01m)(1.00m +/- 0.01m) = 1.00m^2 +/- 0.02m^2 +/- 0.0001m^2 ~ 1.00m^2 +/- 0.02m^2

So the uncertainty is not +/- 0.01, it is +/- 0.02. This can add up quite dramatically. In general if you have x +/- delta, then f(x +/- delta) is not going to be f(x) +/- delta or f(x) +/- f(delta). It needs to be handled carefully.

You're talking like sig figs is error propagation, but it isn't. That's the point. The correct solution is error propagation (with appropriate estimates of the errors of the inputs), not arbitrarily rounding numbers at each step, which is a terrible waste of time and likely to introduce its own errors compared to error propagation.
> You're talking like sig figs is error propagation, but it isn't.

No, that is the exact opposite of what I said! For starters, "uncertainty" and "error" are not the same thing here. I am saying significant figures in a measurement encapsulates an inherent measurement-specific uncertainty conveyed by significant figures, and that this uncertainty must be considered when doing calculations with that measurement. Just like the person I responded to, I don't think you've thought about why significant figures actually exist in the first place.

> The correct solution is error propagation (with appropriate estimates of the errors of the inputs), not arbitrarily rounding numbers at each step

Nowhere in my comment did I arbitrarily round anything. I thoughtfully propagated the uncertainty, which is why it was +/- 0.01m in the measurement and +/- 0.02m^2 in the calculation.

The whole point of my argument is that uncertainties in calculated quantities can be rigorously determined from the uncertainty of the inputs, and measurement inputs have uncertainty determined by the significant figures. On the other hand, ignoring significant figures in calculations means we're ignoring a potential source of uncertainty in downstream analysis. If you think significant figures is about "arbitrarily rounding something" then you are thoughtlessly applying high school chemistry rules. Please read this carefully:

If I measure something with a meterstick that is broken down into centimeters, that measurement has an inherent uncertainty of either +/- 1cm or +/- 0.5cm - which one you use is a problem-specific choice similar to p95 or p99 for statistical significance (if it was a physical meterstick I'd choose 1cm because human eyeballs aren't very good; if it was laser inference I'd choose 0.5cm).

So if I am a data scientist with a database of direct measurements from a meterstick, each one has an inherent uncertainty of +/- 0.01m that's implied by the data source even if it's not in the database. This is the entire point of representing the data as 12.03m, 1.00m, etc, instead of 1.234m. If you represented a measurement as 1.234m that would imply your meterstick could measure decimeters, but it probably can't. So 1.234m isn't merely against the rules, it's inaccurate.

If you take a measured side length of 1.00m and say the calculated area is 1.00m^2, then naively someone might think the uncertainty in the area is +/- 0.01m^2 based on thoughtlessly applying dumb high school chemistry rules. But that's not true, the uncertainty in the calculated area is in reality +/- 0.02m^2. The measurement can be presented without an explicit +/- because the significant figures acts as a "shorthand" and we don't need to do calculations to estimate the uncertainty. But the calculation must present a calculated uncertainty.

Programmers and data scientists are lazy about significant figures because they don't care where the data is coming from, to them it's all doubles in a database, and significant figures is just a matter of rounding things correctly at the end. The area-of-a-square argument proves that this is a mistake.

I still don't get it.

You are explaining error propagation, but my point is that _if you are doing error propagation (as you should do if you want to do things properly), significant figures ARE just for making deliverable pretty_.

You are talking about measurement uncertainty. Measurement uncertainty is written x +- y, with y being the uncertainty.

If you don't do that and use significant digit instead, you lose the information and precision: 10.0 +- 0.1 is 10.0, 10.0 +- 0.2 is 10.0, 10.0 +- 0.3 is 10.0, ...

This is why the other person was talking about "arbitrarily rounding".

You should _never_ said "well, it's a measurement of 10.0 with a 0.2 precision, so I can write 10.0", you should _always_ write 10.0 +- 0.2 (in which case, you can also write 10 +- 0.2 or 10.000 +- 0.2, the significant digits have no impact on any future results). Writing 10.0 instead of 10.0 +- 0.2 is just a terrible practice that does not have much justification, 10 +- 0.2 is always a better way. (and my point is that the problem you have with the significant number disappear if you teach people to use a non-clumsy way)

(and, no, you should not do the distinction "it's a measurement, so it's written differently", because in practice, a lot of "measurements" are in fact already a transformation, and sometimes you cannot even know for sure yourself. For example, a temperature sensor will measure an electrical resistance (with a measurement uncertainty) and then convert it into a temperature, and according to you, it should not be written the same way, just for arbitrary reasons)

I don't think you understand what a measurement is. There's a very good, very short book that explains in more detail what I am talking about, in the context of physics experiments: https://www.amazon.com/Practical-Guide-Data-Analysis/dp/0521...
I think your notions are just too basic. It's a bit like in school when the teacher says "you should write all your sentences as subject + verb + complement". It is good at school, to teach students the basics and to put boundaries of the studied regions (you don't want to have students using more complex notions by accident and having to cover everything in lesson one), but as soon as you begin to be a professional writer, you realise it is better to ignore this rule.

I know the notion of measurement that you try to explain, I've studied it when I was an undergraduate students. Since then, I have passed beyond this notion and use something better. It's not a matter of "you don't understand", it's rather a matter of "you understand too well and see the limits of this notion and that it's not useful for you anymore".

The book you share seems to confirm that: it is for undergraduates. Things get more complicated with real world practice, and the basic rules used to forge the understanding needs to be left behind. For undergraduate students, they are going to do basic lab experiment with a ruler and a chronometer, and the goal is just to practice, not to answer to a real unknown situation. In real life, no one needs to measure things as trivial as what they are measuring. When people do that, they realise that the distinction between calculated value and measured value is meaningless and not helpful at all.

Again, as I've said, you just use x +- y and you don't have any problem. What would be the problem of using x +- y that you will not have otherwise (knowing. of course, that you are educated enough to understand very complex notion and that therefore you totally understand and know things as trivial as significant digit already)?