| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adrianN 2524 days ago
	So what's the reason for this? Is it maybe because of some unicode shenanigans? Java characters are 16bit iirc, and strings have some forty bytes of constant overhead.

2 comments

d2mw 2524 days ago

I'm no Java ninja, but a few things jump out of https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c... :

- at least one heap allocation for every line. After it finds the EOL it first uses 'new String' followed by '.toString()

- the C++ version will almost certainly be backing on to memchr() behind the scenes, which will be using SIMD instructions where it makes sense (e.g. large enough scan size, probably true in this case). the Java version is a manual bytewise-coded loop.

- the C++ version is reusing its output buffer, no reallocations assuming the same string length or less

No idea about encodings in Java, maybe that is playing a role too

link

zlynx 2524 days ago

I haven't looked recently but several years ago I was shocked to discover the GNU libstdc++ didn't use strchr or memchr. It used a hand-coded for loop because it was a template for various kinds of character. There was no specialization for 8-bit char, either.

As a result std::string was disgustingly slow compared to C code.

link

adrianN 2524 days ago

Yes the reuse of the buffer in C++ seems likely and would probably explain a large part of the difference, but I don't know enough of the std::string implementation to be sure about that.

link

soup10 2524 days ago

the stringbuffer and string allocation will make it slower for sure, curious what performance the other ways of reading lines in java have

link

nitwit005 2524 days ago

There is a layer of classes, so presumably there are multiple buffers and extra copying. It's also converting the underlying encoding to UTF-16, allocating String objects, and copying the data into them.

link