Hacker News new | ask | show | jobs
A Python Optimization Anecdote (tech.dropbox.com)
120 points by emwa 5350 days ago
13 comments

I wonder how much JIT compilation would help, without any hand-optimization? e.g. it'd do the initial inlining.

I've been amazed at Java's speedup over multiple runs: dead-slow on startup, then improving rapidly over the next 10-20 runs, and even keeps improving slowly after that. It's a bit magical. Much (all?) of that JIT tech should be applicable to Python, I'd think.

BTW: link is to the comments, not the story

Take a look at PyPy, which goes out of its way to produce incredible results.
I'm just amazed that it took ten iterations until regexps were even considered, especially considering that the set '[a-zA-Z0-9]' is involved…
Lots of function calls in the worst case made it seem unlikely to be a general solution. So I held off on testing that before I'd explored other paths.
What performance gain could be had from using cStringIO as described here:

http://www.skymind.com/~ocrow/python_string/

It seemed like the concatenation was the primary bottleneck in this case.

Also, its worth noting that percentage gains on performance have huge cost savings on infrastructure at scale. That's why blogs like this are valuable because the user experience improves while the cost to provide it is reduced.

By the end the primary overhead was dictionary lookup or iteration overhead, so I doubt this would have mattered
In C you could replace

    if x in WHITELIST
with

    if ((x >= '0' && x <= '9') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= 'a' && x <= 'z'))
which I suspect would be much faster than any hash-table implementation. Also I believe that should work for UTF-8 as well as ascii. I realise that this makes it harder to expand the whitelist. I'm not very familiar with python, is there something similar that could be done?
If you want to be picky, I'ld go with:

    if ((x >= 'a' && x <= 'z') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= '0' && x <= '9'))
As in a regular text file you're more likely to hit an alphanumerical character than a number. If the first OR statement is true, any good compiler will skip the next two clauses.
> If the first OR statement is true, any good compiler will skip the next two clauses.

This has nothing to do with the compiler, it's part of the language specification. If your compiler doesn't comply with this, it's broken.

If you want to be super-picky, then char[256] with 1/0 values for each character will be faster than 6 comparisons in the worst case.
And if you want to be super-super picky, int[256] on 32-bit or long[256] on 64-bit would be even faster.

(Although I'm not sure if lookup would be faster at all than a few comparisons, considering caches and everything, but this is not my domain.)

A jump address table could be even faster since it's not branching...
Would it? Loading a single byte should still be fast (in the absolute worst case, it's a single shift and a mask on top of the cost of loading 4 or 8 bytes) and using less memory means it's much more likely for the table to remain in cache (and evict less other stuff). In fact, I'd go so far as to wager that a uint8_t[16] (or uint64_t[4] or whatever) with one bit per character, manually extracted, would be the fastest way to do it. Not that I've tested this or anything, and I could certainly be wrong.
I hate to break up perfectly good hand-waving with numbers but I put together a test. On my Core i7 with an 18mb input, which is a large novel duplicated many times, the performance factors, normalized to the if-based version are:

    if: 1x
    byte-based lookup: 0.83x
    int-based lookup: 0.84x
    bit-vector based lookup: 0.93x
http://pastebin.com/5twfXfEt
Unless it bloats your data to the point where other stuff falls out of L1 cache.
Let's not go too far with that. L1 cache started with 8K on i486. I'm talking about 256 bytes.
Current L1 caches are 32k. There's an awful lot of data that would like to be in there -- If you're going to talk about micro-optimization, it's something to keep an eye on.
You couldn't in C actually without making assumptions the language doesn't guarantee. The language specifies that you can't guarantee the alphabet is in order like that. Ctypes.h provides functions for checking letters though </pedantic>
Language? The alphabet order is related to the encoding, and UTF-8 (which is what they're probably using) certainly does ensure that, since it's backwards compatible with ASCII.
Sure, that works for ASCII and UTF-8, but what if I'm on an IBM System Z machine? Then everything's in EBCDIC, and that's just a nightmare.
Then you read in ASCII or UTF-8, and don't do your I/O in EBCDIC. Also, you run Z/Linux.
Then you write your own optimization for that case -- for most people assuming that 0-9, A-Z, and a-z will work just fine. Chances are slim that you're running on a system using EBCDIC without knowing it.
Or, as was said, use ctypes.h, which is the Right Thing.
You could do that in Python using the ord function.

    white_ranges = [('0', '9'), ('A', 'Z'), ('a', 'z')]
    any([ord(low) <= ord(x) <= ord(high) for low, high in white_ranges])
In python there's no such thing as a char, there's only single-character strings. The above would actually be doing 6 string comparisons. That may still be faster... or may not. Measuring is the only way to be sure.
since there are only ascii chars in the whitelist you could just declare an array of size 256 (doesn't really matter if it's a bitvector or char array, int array or whatever) and replace in WHITELIST by

    return (CHAR_MAP_ARRAY[c] = 'y')
Any UTF-8 start or follow byte shoudl be > 128 where there should always be values indicating false
I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... })

Then in your loop you're only making dict lookups.

That's effectively what the cache is. Dictionary lookups are slow enough that we want to fast-path the case of English letters.
Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])
Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.
I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.

If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.

Python's lists are not linked lists. List indexing is O(1).
You are right, double checked in my Martelli.
Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).
It would be nice if the author profiled the code instead of just measuring the test templates time.
Interesting story, with a very good speedup. Though I personally wouldn't be this patient, and would write a simple function like this in cython or even the C API directly (especially as the Python code drifts further from idiomatic with each step...).
Writing it in Cython would have been my solution. No futzing around with 8 or 9 iterations. It would just be fast, and the code would still look clean (unlike the obfuscated garbage they ended up with in the blog post).
At the risk of exposing my ignorance, I thought

>Other “common wisdom”, like using locals instead of globals, yields relatively little gain.

this advice was typically more related to avoiding collisions with variable names, rather than performance?

In Python, it's both - the basic structured-programming advice to avoid global variables is always good, but it's a specific quirk of Python that makes global variables (including built-in functions) slower than local variables.

Python has full dynamic scoping, which means that inside a function you can refer to any variables set in outer scopes. Because Python is a dynamic language, every time you refer to a variable, the Python interpreter looks for it in the local scope first, and then each enclosing scope until it hits the containing module. A local variable will always be found in the first iteration of that loop, a global variable will take at least two iterations.

Another CPython quirk is that global (module-level) variables are looked up in a dict, but a function's local variables normally get a reserved chunk of memory that's directly indexable.
> Python has full dynamic scoping, which means that inside a function you can refer to any variables set in outer scopes.

You are describing lexical scope, not dynamic scope.

A "global" variable is one that is not residing in the local scope. Global here is really non-local, which is a term that changed during the py3k transition. As you see, str and ord are "system" functions, but really are variables containing function objects. The same goes for WHITELIST, which is a variable (conventionally named as a constant) probably defined at the module level. All of those are not local to the function, hence more taxing to call in a loop.

Besides, the general advice is not to avoid collisions (you use both namespaces and locality to resolve that) as the non local stuff here really has a reason to be global, but non local stuff has to survive concurrency. It is the case here as the things called really are constants.

This is slightly better, IMHO: http://www.python.org/doc/essays/list2str.html

Some best practices when optimizing CPython code:

* Re-evaluate your algorithm (an inefficient quicksort is still faster than an optimized bubblesort)

* Use Python functions and constructs implemented in C (ex. most builtins, list comprehensions)

* Move loops from outside functions to inside (function call overhead is high)

* Use try/except to handle uncommon cases rather than using conditional checks in a loop.

* Eliminate dots (attribute lookup) in tight loops (create a local alias if needed)

See also: http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Why is string interpolation in Python so slow?
Good point which would be worth an article in itself. Also comparing the various string interpolation methods that Python has (% versus format).
Clean C > Ugly Python, if you're really going to force the issue out of Python's comfort zone (readability).
Dude, use PyPy. Seriously. Please. Sacrificing readability for this kind of work is not good.
I would love to see the results of a similar test case for PyPy.
I haven't written any python in a while, but a 15% speedup from inlining a function call? Really? This reinforces my preference for statically typed languages.
I haven't written C in a while, but I dereferenced null and my program crashed. No stack trace, no exception? Really? This reinforces my preference for dynamic languages.

How about we stop painting pictures of languages from one specific difference.

I agree with you, not the parent, but I just want to clarify for posterity that it is generally possible to get a stacktrace from a crashed C program. :-)