| It is quite close, actually. "not even remotely" is a very strong statement. There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86), but that doesn't change the fact that in order to compare two 128KB strings which are equal, you actually have to read 2*128KB from memory and compare each single byte (in groups of 8, if you are lucky enough with your alignments and instruction set). Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access: (a) if both strings are interned, it is enough to do a pointer comparison. (b) if the length is not equal, the strings are not equal - a one word comparison. (c) if both strings have been hashed before (quite likely), you can tell they are different if their hash is different - a one word comparison. (d) if length is equal, and hash is (equal or uncomputed), you will have to do the comparison. Whether this trade-off is worthy depends on your application. If most of your strings are 7-characters or less (as is often the case for software dealing with e.g. stock tickers), then the C approach on 64-bit archs wins hands down: you should actually have all the strings in-place because a pointer takes more memory and causes contention. However, if your strings tend to be 100 bytes or above, and many of them have equal prefixes, the Python approach wins hands down. |
It's actually 16 bytes at a time on any machine that supports SSE (maybe 32 bytes soon with AVX).
> Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:
Sure, different abstractions have different trade-offs. All of these abstraction possibilities are available to C. strlen() isn't "the C approach," it's just the most common one. Any C application where comparison of long almost-identical strings is important will surely use techniques similar to what Python does. On the other hand, the reverse is not true: Python does not have access to all the same optimizations that a C programmer could draw upon to do string processing.