Hacker News new | ask | show | jobs
by bhouston 2490 days ago
In my experience there are few things slower that float to string and string to float. And it seems so unnecessary.

I always implemented round to a specific digit based on the built-in roundss/roundsd functions which are native x86-64 assembler instructions (i.e. https://www.felixcloutier.com/x86/roundsd).

I do not understand why this would not be preferable to the string method.

float round( float x, int digits, int base) { float factor = pow( base, digits ); return roundss( x * factor ) / factor; }

I guess this has the effect of not working for numbers near the edge of it's range.

One could check this and fall back to the string method. Or alternatively use higher precision doubles internally:

float round( float x, int digits, int base ) { double factor = pow( base, digits ); return (float)( roundsd( x * factor ) / factor ); }

But then what do you do if you have a double rounded and want to maintain all precision? I think there is likely some way to do that by somehow unpacking the double into a manual mantissa and exponent each of which are doubles and doing this manually - or maybe using some type of float128 library (https://www.boost.org/doc/libs/1_63_0/libs/multiprecision/do...)...

But changing this implementation now could cause slight differences and if someone was rounding then hashing this type of changes could be horrible if not behind some type of opt-in.

3 comments

float to string is incredibly fast now - look at Ulf Adams’ Ryu and Ryu Printf algorithms, which I’ve used to implement C++17 <charconv> in Visual Studio 2019 (16.2 has everything implemented except general precision; the upcoming 16.4 release adds that).

I don’t know of truly fast algorithms for string to float, although I improved upon our CRT’s performance by 40%.

Formatting is much faster than before, but still terribly slow compared to simply rounding numbers using math and floor and ceiling appropriately.

Ryu is more than 100x slower than something like

rval = floor(100*val+0.5)/100.0

(which is not quite right due to numerical issues, but close, and illustrates the idea).

Formatting, to get a rounded float, is terribly slow.

I don't think converting is slow by itself depending on what you need done. I have a function in my code I wrote to convert strings of floats/doubles to a rounded string of however many digits you want(for storing financial data that was muxed with multiple streams), and converting 1.59688139452f to a string, and then rounding that string to the 5th decimal place took 8.759 seconds for 10 million iterations (87.5 nanoseconds/iteration). Granted, it was written for my specific use case, so I don't need to handle the various edge cases/formats. But, little is inherently slow if you take the time to write a solution for your own need.
Using native x86_64 instructions isn't portable.
Why not have optimized versions that use native instructions when available, and then fall back to the portable version when they are not?
I'm unsure as to why they don't do that, I suspect it's because nobody using Python has found floating-point rounding to be a bottleneck yet.
I have encountered situations where irregular rounding became solvable but annoyingly problematic to detect / calculate, in the LANL Earthquake dataset on Kaggle, it had a column with samples and a column with (incorrectly incrementing) sample times that were rounded. In order to create a corrected column, I noticed quite a lot of irregularities in the python rounding (or the underlying mechanism)

I also consider simultaneously deterministic, fast, portable rounding or binary formats to be important for decentralized delegation of computation, say a TrueBit style platform:

https://people.cs.uchicago.edu/~teutsch/papers/truebit.pdf

It's because Python is slow everywhere, so it's hard to find bottlenecks like this. This method is easily 100x slower than need be. If everything else were well written, this would stand out. Since lots of Python is written like this, none stand out.

Death by 1000 papercuts....

More specifically, cpython generally biases toward simpler implementations. I'd be surprised if there's any custom x86 assembly anywhere in cpython.
I suspect different behaviors might be an issue in some tricky cases.