So what's the reason for this? Is it maybe because of some unicode shenanigans? Java characters are 16bit iirc, and strings have some forty bytes of constant overhead.
- at least one heap allocation for every line. After it finds the EOL it first uses 'new String' followed by '.toString()
- the C++ version will almost certainly be backing on to memchr() behind the scenes, which will be using SIMD instructions where it makes sense (e.g. large enough scan size, probably true in this case). the Java version is a manual bytewise-coded loop.
- the C++ version is reusing its output buffer, no reallocations assuming the same string length or less
No idea about encodings in Java, maybe that is playing a role too
I haven't looked recently but several years ago I was shocked to discover the GNU libstdc++ didn't use strchr or memchr. It used a hand-coded for loop because it was a template for various kinds of character. There was no specialization for 8-bit char, either.
As a result std::string was disgustingly slow compared to C code.
Yes the reuse of the buffer in C++ seems likely and would probably explain a large part of the difference, but I don't know enough of the std::string implementation to be sure about that.
There is a layer of classes, so presumably there are multiple buffers and extra copying. It's also converting the underlying encoding to UTF-16, allocating String objects, and copying the data into them.
- at least one heap allocation for every line. After it finds the EOL it first uses 'new String' followed by '.toString()
- the C++ version will almost certainly be backing on to memchr() behind the scenes, which will be using SIMD instructions where it makes sense (e.g. large enough scan size, probably true in this case). the Java version is a manual bytewise-coded loop.
- the C++ version is reusing its output buffer, no reallocations assuming the same string length or less
No idea about encodings in Java, maybe that is playing a role too