I wonder how much JIT compilation would help, without any hand-optimization? e.g. it'd do the initial inlining.
I've been amazed at Java's speedup over multiple runs: dead-slow on startup, then improving rapidly over the next 10-20 runs, and even keeps improving slowly after that. It's a bit magical. Much (all?) of that JIT tech should be applicable to Python, I'd think.
Lots of function calls in the worst case made it seem unlikely to be a general solution. So I held off on testing that before I'd explored other paths.
It seemed like the concatenation was the primary bottleneck in this case.
Also, its worth noting that percentage gains on performance have huge cost savings on infrastructure at scale. That's why blogs like this are valuable because the user experience improves while the cost to provide it is reduced.
if ((x >= '0' && x <= '9') ||
(x >= 'A' && x <= 'Z') ||
(x >= 'a' && x <= 'z'))
which I suspect would be much faster than any hash-table implementation. Also I believe that should work for UTF-8 as well as ascii. I realise that this makes it harder to expand the whitelist. I'm not very familiar with python, is there something similar that could be done?
if ((x >= 'a' && x <= 'z') ||
(x >= 'A' && x <= 'Z') ||
(x >= '0' && x <= '9'))
As in a regular text file you're more likely to hit an alphanumerical character than a number. If the first OR statement is true, any good compiler will skip the next two clauses.
Would it? Loading a single byte should still be fast (in the absolute worst case, it's a single shift and a mask on top of the cost of loading 4 or 8 bytes) and using less memory means it's much more likely for the table to remain in cache (and evict less other stuff). In fact, I'd go so far as to wager that a uint8_t[16] (or uint64_t[4] or whatever) with one bit per character, manually extracted, would be the fastest way to do it. Not that I've tested this or anything, and I could certainly be wrong.
I hate to break up perfectly good hand-waving with numbers but I put together a test. On my Core i7 with an 18mb input, which is a large novel duplicated many times, the performance factors, normalized to the if-based version are:
Current L1 caches are 32k. There's an awful lot of data that would like to be in there -- If you're going to talk about micro-optimization, it's something to keep an eye on.
You couldn't in C actually without making assumptions the language doesn't guarantee. The language specifies that you can't guarantee the alphabet is in order like that. Ctypes.h provides functions for checking letters though </pedantic>
Language? The alphabet order is related to the encoding, and UTF-8 (which is what they're probably using) certainly does ensure that, since it's backwards compatible with ASCII.
Then you write your own optimization for that case -- for most people assuming that 0-9, A-Z, and a-z will work just fine. Chances are slim that you're running on a system using EBCDIC without knowing it.
In python there's no such thing as a char, there's only single-character strings. The above would actually be doing 6 string comparisons. That may still be faster... or may not. Measuring is the only way to be sure.
since there are only ascii chars in the whitelist you could just declare an array of size 256 (doesn't really matter if it's a bitvector or char array, int array or whatever) and replace in WHITELIST by
return (CHAR_MAP_ARRAY[c] = 'y')
Any UTF-8 start or follow byte shoudl be > 128 where there should always be values indicating false
I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... })
Then in your loop you're only making dict lookups.
Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])
Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.
I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.
If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.
Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).
Interesting story, with a very good speedup. Though I personally wouldn't be this patient, and would write a simple function like this in cython or even the C API directly (especially as the Python code drifts further from idiomatic with each step...).
Writing it in Cython would have been my solution. No futzing around with 8 or 9 iterations. It would just be fast, and the code would still look clean (unlike the obfuscated garbage they ended up with in the blog post).
In Python, it's both - the basic structured-programming advice to avoid global variables is always good, but it's a specific quirk of Python that makes global variables (including built-in functions) slower than local variables.
Python has full dynamic scoping, which means that inside a function you can refer to any variables set in outer scopes. Because Python is a dynamic language, every time you refer to a variable, the Python interpreter looks for it in the local scope first, and then each enclosing scope until it hits the containing module. A local variable will always be found in the first iteration of that loop, a global variable will take at least two iterations.
Another CPython quirk is that global (module-level) variables are looked up in a dict, but a function's local variables normally get a reserved chunk of memory that's directly indexable.
A "global" variable is one that is not residing in the local scope. Global here is really non-local, which is a term that changed during the py3k transition. As you see, str and ord are "system" functions, but really are variables containing function objects. The same goes for WHITELIST, which is a variable (conventionally named as a constant) probably defined at the module level. All of those are not local to the function, hence more taxing to call in a loop.
Besides, the general advice is not to avoid collisions (you use both namespaces and locality to resolve that) as the non local stuff here really has a reason to be global, but non local stuff has to survive concurrency. It is the case here as the things called really are constants.
I haven't written any python in a while, but a 15% speedup from inlining a function call? Really? This reinforces my preference for statically typed languages.
I haven't written C in a while, but I dereferenced null and my program crashed. No stack trace, no exception? Really? This reinforces my preference for dynamic languages.
How about we stop painting pictures of languages from one specific difference.
I agree with you, not the parent, but I just want to clarify for posterity that it is generally possible to get a stacktrace from a crashed C program. :-)
I've been amazed at Java's speedup over multiple runs: dead-slow on startup, then improving rapidly over the next 10-20 runs, and even keeps improving slowly after that. It's a bit magical. Much (all?) of that JIT tech should be applicable to Python, I'd think.
BTW: link is to the comments, not the story