| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by haberman 5002 days ago
	> Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings? I'm sure you must know this, but this is not even remotely how strcmp() is implemented in modern libc's.

2 comments

beagle3 5002 days ago

It is quite close, actually. "not even remotely" is a very strong statement.

There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86), but that doesn't change the fact that in order to compare two 128KB strings which are equal, you actually have to read 2*128KB from memory and compare each single byte (in groups of 8, if you are lucky enough with your alignments and instruction set).

Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

(a) if both strings are interned, it is enough to do a pointer comparison.

(b) if the length is not equal, the strings are not equal - a one word comparison.

(c) if both strings have been hashed before (quite likely), you can tell they are different if their hash is different - a one word comparison.

(d) if length is equal, and hash is (equal or uncomputed), you will have to do the comparison.

Whether this trade-off is worthy depends on your application. If most of your strings are 7-characters or less (as is often the case for software dealing with e.g. stock tickers), then the C approach on 64-bit archs wins hands down: you should actually have all the strings in-place because a pointer takes more memory and causes contention. However, if your strings tend to be 100 bytes or above, and many of them have equal prefixes, the Python approach wins hands down.

link

haberman 5002 days ago

> There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

> Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:

Sure, different abstractions have different trade-offs. All of these abstraction possibilities are available to C. strlen() isn't "the C approach," it's just the most common one. Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does. On the other hand, the reverse is not true: Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

link

beagle3 5002 days ago

I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

> All of these abstraction possibilities are available to C

That's a tautology at best, and meaningless at worst. The way strcmp() is implemented, which we discussed above, is not actually available in C.

> Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does.

And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

> Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.

Pure python is more limited than C, true. But specific Python implementations (RPython, PyPy, IronPython) might have better access to some optimizations than specific C implementations.

And there's always the aspect of "what's theoretically possible" and "what happens in practice". The fact that PyPy will dynamically switch from 32-bit to 64-bit to unbounded-long-integer might make a real difference on a 32-bit machine where the code might occasionally require 2048 bits, but overwhelmingly requires just 32 bits.

It is possible to construct pathological cases where there are e.g. pow(2,128) possible type combinations within a function, the exact combination is only known from the data (but is consistent for an entire run) - in which case, PyPy will essentially compile the right program to machine code, whereas you cannot do AOT because of the number of combinations; which means a C program will essentially be an interpreter based on those types.

But I don't care about theoretically constructed pathologies. In practice, especially with time-to-implement constraints, it is not true that a C programmer has all the tools at their disposal that higher level languages have.

link

haberman 5002 days ago

> I was mostly replying to your assertion that "that's not even remotely how strcmp is implemented", which, for most definition of "even remotely", is false.

eglibc's SSE2 implementation of strcmp is just over 5k of machine code, whereas the simple implementation compiles to 56 bytes on x64-64. That was my definition of "not even remotely." I did not mean to imply that it was a fundamentally different algorithm, only that it was a far more sophisticated and optimized implementation of the same algorithm. My apologies if this was unclear or appeared overstated.

> That's a tautology at best, and meaningless at worst.

By "these abstraction possibilities" I meant the ones you mentioned, which is true.

> And similarly, any Python application that requires (insert some uncommon requirement ..) can do what C can with the same kind of help that strcmp() gets - by delegating to the layer that does it best.

That's great and I fully support that. What I am arguing against is high-level language cheerleading that discounts the importance of C (or assembly for that matter). Since you mention PyPy, I have to say that their PR is some of the worst in this regard; some of their "faster than C" claims are actively misleading, like this one that benchmarks some generated string formatting code against sprintf() (which re-parses the format string every time): http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-a...

link

marshray 5002 days ago

> > There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86)

> It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).

Is there a sequence of fewer than 16 instructions to spot a NUL byte inside the 16 byte block?

link

haberman 5001 days ago

> Is there a sequence of fewer than 16 instructions to spot a NUL byte inside the 16 byte block?

Yes:

    pxor  %xmm1, %xmm1
    pcmpeqb (mem), %xmm1  // Do 16 byte-wise compares
    pmovmskb %xmm1, %eax  // Move results into the low 16 bits
    test %eax, %eax
    jnz saw_null

link

marshray 5001 days ago

Very cool.

link

aptwebapps 5002 days ago

What's a modern libc? Not being snide or anything: I have no idea. I didn't know where to look so I looked at glibc [1] and that, in fact, does seem to be how it works.

[1] http://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcm...

link

pbsd 5002 days ago

You might want to look at the non-generic implementations, like http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...

link

darklajid 5002 days ago

This might be the most delicious piece of code ever written ( I cannot judge that, really), but we're talking 2307 lines for a string comparison.

I'm impressed, but also scared and amused. I always look with envy at system level guys, lacking the knowledge to play on that level. This, though, comforts me quite a bit. That's just not my definition of 'fun'.

link

aptwebapps 5002 days ago

I've never written anything non-trival in C and even that was ages ago so I didn't hope to be able judge the niceties of different implementations but I can't even tell what the algorithm is in that one.

link

pirateking 5002 days ago

I have heard of musl[1] and Bionic[2]. Interested in hearing about any others.

[1] http://www.musl-libc.org/

[2] http://en.wikipedia.org/wiki/Bionic_(software)

link

aptwebapps 5002 days ago

Those seem to both use basically the same algo as the glibc one, all though a little more compactly written.

link

justincormack 5002 days ago

Uclibc and dietlibc. Must recommend Musl for code readability and implementation quality.

link