Hacker News new | ask | show | jobs
by derefr 2524 days ago
I don't know what Java's BufferedReader is doing, but it's probably not the optimal thing in terms of IO throughput. I would blame the algorithm long before blaming anything inherent about the JVM.

Erlang is another language where "naive" IO is kind of slow. https://github.com/bbense/beatwc/ is a project someone did to test various methods of doing IO in Erlang/Elixir, and their performance for a line-counting task, relative to the Unix wc(1) command.

It's interesting to see which approaches are faster. Yes, parallelism gains you a bit, but a much larger win comes from avoiding the stutter-stop effect of cutting the read buffer off whenever you hit a newline. Instead, the read buffer should be the same size as your IO source's optimal read-chunk size (a disk block; a TCP huge packet), and you should grab a whole buffer-ful of lines at a time, do a pattern-matching binary scan to collect all the indices of the newlines, and then use those indices to part the buffer out as slice references.

This achieves quite a dramatic speedup, since most of the time you don't need movable copies of the lines, and can copy the line (or more likely just part of it) yourself when you need to hold onto it.

This approach is probably also already built in to Java's "better" IO libraries, like NIO.

2 comments

Java NIO does exactly what you describe. IO is chunked optimally, the entire buffer is pulled, the API exposes a select() mechanism just like any libc, and the user is expected to frame its own received data.

edit: Lemire is showing a lack of what Fowler describes as "mechanical sympathy".

https://martinfowler.com/articles/lmax.html#QueuesAndTheirLa...

A bit of nitpicking on that: Even Javas nio API isn't very efficient. select() introduces a ton of garbage due to dynamically allocating lists for selected keys, and keys have a lot of synchronization overhead. That's why frameworks like Netty gained a lot of performance by building their own selectors using JNI.
I'm reminded to add, in the vein of the author's complaint, that there is a similar ridiculousness in Erlang land, that cannot be circumnavigated so easily: reading/writing to zlib-compressed files using Erlang's file:open(..., [compressed]) option—or generating/parsing zlib-compressed ETF binaries using erlang:binary_to_term(..., [compressed]))—holds [the moral equivalent of†] a global lock. Only one process can be zlib-compressing or zlib-decompressing a chunk of data at once, no matter how many cores your system has.

This means that, even when your data set compresses so well that you'd theoretically gain a ton of speed by having the data streamed from disk compressed, and then decompressed during parsing—this doesn't apply in practice, since you're introducing an artificial bottleneck in your IO reads.

I'm not actually sure if this is a bug in Erlang, per se, or if it's just the intended behavior and compressed file IO was never intended to be used for performance, only for e.g. embedded devices with tiny ROMs.

(If people here think it's a bug, I'll probably go to the effort at some point to profile the performance impact and submit it as a bug on https://bugs.erlang.org.)

† What it's actually doing, is that all zlib compression/decompression passes get sent to a zlib "port driver" as messages. Port drivers can handle multiple requests in-flight at once (they're the in-process equivalent to sockets—each Erlang process's port against the port-driver is its own "connection") but the zlib port driver shim is coded to expose zlib as a single-threaded, blocking, request-response style of server, rather than one that accepts connections in parallel and instantiates a separate zlib context for each separate connection it receives.

That's probably not true in the recent versions - in OTP 20 the zlib integration was reworked and is now based on NIFs (similar to Java's JNI) instead of port drivers.