|
|
|
|
|
by haberman
5002 days ago
|
|
> Or do you think ticking through arrays of hopefully-ASCII bytes, byte by byte, waiting for the 0 byte, is the fastest way to compare two strings? I'm sure you must know this, but this is not even remotely how strcmp() is implemented in modern libc's. |
|
There are tricks that let them do this 8 bytes at a time (on AMD64, 4 bytes on x86), but that doesn't change the fact that in order to compare two 128KB strings which are equal, you actually have to read 2*128KB from memory and compare each single byte (in groups of 8, if you are lucky enough with your alignments and instruction set).
Different abstractions, such as Python's strings, can very often do this comparison with almost no memory access:
(a) if both strings are interned, it is enough to do a pointer comparison.
(b) if the length is not equal, the strings are not equal - a one word comparison.
(c) if both strings have been hashed before (quite likely), you can tell they are different if their hash is different - a one word comparison.
(d) if length is equal, and hash is (equal or uncomputed), you will have to do the comparison.
Whether this trade-off is worthy depends on your application. If most of your strings are 7-characters or less (as is often the case for software dealing with e.g. stock tickers), then the C approach on 64-bit archs wins hands down: you should actually have all the strings in-place because a pointer takes more memory and causes contention. However, if your strings tend to be 100 bytes or above, and many of them have equal prefixes, the Python approach wins hands down.