A Python Optimization Anecdote | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	A Python Optimization Anecdote (tech.dropbox.com)
	120 points by emwa 5350 days ago

13 comments

6ren 5350 days ago

I wonder how much JIT compilation would help, without any hand-optimization? e.g. it'd do the initial inlining.

I've been amazed at Java's speedup over multiple runs: dead-slow on startup, then improving rapidly over the next 10-20 runs, and even keeps improving slowly after that. It's a bit magical. Much (all?) of that JIT tech should be applicable to Python, I'd think.

BTW: link is to the comments, not the story

lloeki 5350 days ago

Take a look at PyPy, which goes out of its way to produce incredible results.

mhd 5350 days ago

I'm just amazed that it took ten iterations until regexps were even considered, especially considering that the set '[a-zA-Z0-9]' is involved…

pavpanchekha 5350 days ago

Lots of function calls in the worst case made it seem unlikely to be a general solution. So I held off on testing that before I'd explored other paths.

padobson 5350 days ago

What performance gain could be had from using cStringIO as described here:

http://www.skymind.com/~ocrow/python_string/

It seemed like the concatenation was the primary bottleneck in this case.

Also, its worth noting that percentage gains on performance have huge cost savings on infrastructure at scale. That's why blogs like this are valuable because the user experience improves while the cost to provide it is reduced.

pavpanchekha 5350 days ago

By the end the primary overhead was dictionary lookup or iteration overhead, so I doubt this would have mattered

Jabbles 5350 days ago

In C you could replace

    if x in WHITELIST

with

    if ((x >= '0' && x <= '9') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= 'a' && x <= 'z'))

which I suspect would be much faster than any hash-table implementation. Also I believe that should work for UTF-8 as well as ascii. I realise that this makes it harder to expand the whitelist. I'm not very familiar with python, is there something similar that could be done?

bbrizzi 5350 days ago

If you want to be picky, I'ld go with:

    if ((x >= 'a' && x <= 'z') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= '0' && x <= '9'))

As in a regular text file you're more likely to hit an alphanumerical character than a number. If the first OR statement is true, any good compiler will skip the next two clauses.

schrototo 5350 days ago

> If the first OR statement is true, any good compiler will skip the next two clauses.

This has nothing to do with the compiler, it's part of the language specification. If your compiler doesn't comply with this, it's broken.

viraptor 5350 days ago

If you want to be super-picky, then char[256] with 1/0 values for each character will be faster than 6 comparisons in the worst case.

DrJokepu 5350 days ago

And if you want to be super-super picky, int[256] on 32-bit or long[256] on 64-bit would be even faster.

(Although I'm not sure if lookup would be faster at all than a few comparisons, considering caches and everything, but this is not my domain.)

viraptor 5350 days ago

A jump address table could be even faster since it's not branching...

mikeash 5350 days ago

Would it? Loading a single byte should still be fast (in the absolute worst case, it's a single shift and a mask on top of the cost of loading 4 or 8 bytes) and using less memory means it's much more likely for the table to remain in cache (and evict less other stuff). In fact, I'd go so far as to wager that a uint8_t[16] (or uint64_t[4] or whatever) with one bit per character, manually extracted, would be the fastest way to do it. Not that I've tested this or anything, and I could certainly be wrong.

maximilianburke 5350 days ago

I hate to break up perfectly good hand-waving with numbers but I put together a test. On my Core i7 with an 18mb input, which is a large novel duplicated many times, the performance factors, normalized to the if-based version are:

    if: 1x
    byte-based lookup: 0.83x
    int-based lookup: 0.84x
    bit-vector based lookup: 0.93x

http://pastebin.com/5twfXfEt

ori_b 5350 days ago

Unless it bloats your data to the point where other stuff falls out of L1 cache.

viraptor 5350 days ago

Let's not go too far with that. L1 cache started with 8K on i486. I'm talking about 256 bytes.

ori_b 5350 days ago

Current L1 caches are 32k. There's an awful lot of data that would like to be in there -- If you're going to talk about micro-optimization, it's something to keep an eye on.

sausagefeet 5350 days ago

You couldn't in C actually without making assumptions the language doesn't guarantee. The language specifies that you can't guarantee the alphabet is in order like that. Ctypes.h provides functions for checking letters though </pedantic>

icebraining 5350 days ago

Language? The alphabet order is related to the encoding, and UTF-8 (which is what they're probably using) certainly does ensure that, since it's backwards compatible with ASCII.

CountHackulus 5350 days ago

Sure, that works for ASCII and UTF-8, but what if I'm on an IBM System Z machine? Then everything's in EBCDIC, and that's just a nightmare.

ori_b 5350 days ago

Then you read in ASCII or UTF-8, and don't do your I/O in EBCDIC. Also, you run Z/Linux.

maximilianburke 5350 days ago

Then you write your own optimization for that case -- for most people assuming that 0-9, A-Z, and a-z will work just fine. Chances are slim that you're running on a system using EBCDIC without knowing it.

beej71 5350 days ago

Or, as was said, use ctypes.h, which is the Right Thing.

JesseAldridge 5350 days ago

You could do that in Python using the ord function.

    white_ranges = [('0', '9'), ('A', 'Z'), ('a', 'z')]
    any([ord(low) <= ord(x) <= ord(high) for low, high in white_ranges])

vilya 5350 days ago

In python there's no such thing as a char, there's only single-character strings. The above would actually be doing 6 string comparisons. That may still be faster... or may not. Measuring is the only way to be sure.

bjoernbu 5350 days ago

since there are only ascii chars in the whitelist you could just declare an array of size 256 (doesn't really matter if it's a bitvector or char array, int array or whatever) and replace in WHITELIST by

    return (CHAR_MAP_ARRAY[c] = 'y')

Any UTF-8 start or follow byte shoudl be > 128 where there should always be values indicating false

cool-RR 5350 days ago

I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... })

Then in your loop you're only making dict lookups.

pavpanchekha 5350 days ago

That's effectively what the cache is. Dictionary lookups are slow enough that we want to fast-path the case of English letters.

Dylan16807 5350 days ago

Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])

cool-RR 5350 days ago

Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.

gbog 5350 days ago

I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.

If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.

kbd 5350 days ago

Python's lists are not linked lists. List indexing is O(1).

gbog 5350 days ago

You are right, double checked in my Martelli.

pwang 5350 days ago

Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).

pdhborges 5350 days ago

It would be nice if the author profiled the code instead of just measuring the test templates time.

wladimir 5350 days ago

Interesting story, with a very good speedup. Though I personally wouldn't be this patient, and would write a simple function like this in cython or even the C API directly (especially as the Python code drifts further from idiomatic with each step...).

pwang 5350 days ago

Writing it in Cython would have been my solution. No futzing around with 8 or 9 iterations. It would just be fast, and the code would still look clean (unlike the obfuscated garbage they ended up with in the blog post).

phren0logy 5350 days ago

At the risk of exposing my ignorance, I thought

>Other “common wisdom”, like using locals instead of globals, yields relatively little gain.

this advice was typically more related to avoiding collisions with variable names, rather than performance?

thristian 5350 days ago

In Python, it's both - the basic structured-programming advice to avoid global variables is always good, but it's a specific quirk of Python that makes global variables (including built-in functions) slower than local variables.

Python has full dynamic scoping, which means that inside a function you can refer to any variables set in outer scopes. Because Python is a dynamic language, every time you refer to a variable, the Python interpreter looks for it in the local scope first, and then each enclosing scope until it hits the containing module. A local variable will always be found in the first iteration of that loop, a global variable will take at least two iterations.

encukou 5350 days ago

Another CPython quirk is that global (module-level) variables are looked up in a dict, but a function's local variables normally get a reserved chunk of memory that's directly indexable.

adgar 5350 days ago

> Python has full dynamic scoping, which means that inside a function you can refer to any variables set in outer scopes.

You are describing lexical scope, not dynamic scope.

lloeki 5350 days ago

A "global" variable is one that is not residing in the local scope. Global here is really non-local, which is a term that changed during the py3k transition. As you see, str and ord are "system" functions, but really are variables containing function objects. The same goes for WHITELIST, which is a variable (conventionally named as a constant) probably defined at the module level. All of those are not local to the function, hence more taxing to call in a loop.

Besides, the general advice is not to avoid collisions (you use both namespaces and locality to resolve that) as the non local stuff here really has a reason to be global, but non local stuff has to survive concurrency. It is the case here as the things called really are constants.

mace 5349 days ago

This is slightly better, IMHO: http://www.python.org/doc/essays/list2str.html

Some best practices when optimizing CPython code:

* Re-evaluate your algorithm (an inefficient quicksort is still faster than an optimized bubblesort)

* Use Python functions and constructs implemented in C (ex. most builtins, list comprehensions)

* Move loops from outside functions to inside (function call overhead is high)

* Use try/except to handle uncommon cases rather than using conditional checks in a loop.

* Eliminate dots (attribute lookup) in tight loops (create a local alias if needed)

See also: http://wiki.python.org/moin/PythonSpeed/PerformanceTips

dgrant 5350 days ago

Why is string interpolation in Python so slow?

wladimir 5350 days ago

Good point which would be worth an article in itself. Also comparing the various string interpolation methods that Python has (% versus format).

jamwt 5350 days ago

Clean C > Ugly Python, if you're really going to force the issue out of Python's comfort zone (readability).

MostAwesomeDude 5350 days ago

Dude, use PyPy. Seriously. Please. Sacrificing readability for this kind of work is not good.

jacoblyles 5350 days ago

I would love to see the results of a similar test case for PyPy.

captain-asshat 5350 days ago

I haven't written any python in a while, but a 15% speedup from inlining a function call? Really? This reinforces my preference for statically typed languages.

pavpanchekha 5350 days ago

I haven't written C in a while, but I dereferenced null and my program crashed. No stack trace, no exception? Really? This reinforces my preference for dynamic languages.

How about we stop painting pictures of languages from one specific difference.

beej71 5350 days ago

I agree with you, not the parent, but I just want to clarify for posterity that it is generally possible to get a stacktrace from a crashed C program. :-)