I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... })
Then in your loop you're only making dict lookups.
Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])
Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.
I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.
If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.
Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).