Hacker News new | ask | show | jobs
by lessbergstein 898 days ago
I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.

https://stackoverflow.com/questions/277309/java-floating-poi...

2 comments

The difficulty is creating the fastest implementation. If you look at the results of the submissions so far you’ll see a big difference in duration, between 11 seconds and more than 4 minutes.

11 seconds seems pretty impressive for a 12Gb file. Would be interesting to know what programming language could do it faster. For a database comparison you’d probably want to include loading the data into your database for a fair comparrison.

Perl would do it quite fast and it has the benefit of accessing posix primitives directly.
A naive perl solution is really really slow compared to even the reference Java implementation. (I know, I've tried)
That's strange, you should be able to stream the file right into a tiny perl executable at the same speed as the bottlenecking hardware. The kernel will take care of all the logistics. You're probably trying to do too much explicitly. Just use a pipe. Perl should be done before Jit completes.
Using cat to redirect the file to /dev/null takes 18s on my machine (a low-end NUC). Just running a noop on the file in Perl (ie. feeding it into a `while (<>)` loop but not acting on the contents) takes ~2 minutes.

1B lines is a lot, and Java ain't a slouch.

Why are you using cat at all? Use a pipe. This isn't hard stuff. Don't use <>, feed the file into a scalar or array. it should only take a few seconds to process a billion lines.

https://www.perl.com/pub/2003/11/21/slurp.html/#:~:text=Anot....

I profiled my attempt, actually reading each line is the bottleneck.
Perl is always going to be much faster than Java at tasks like this. Use stdin and chomp() instead of reading each line explicitly.

This is really a small, trivial task for a perl script. Even with a billion lines this is nothing for a modern cpu and perl.

It’s easy to solve but even fizzbuzz becomes complicated if you want double digit GB/s output.
It's really not. We're talking about gigahertz CPUs and likely solid state storage that can stream many gb/s.. running through a perl script. There really isn't much that is faster than that.