Hacker News new | ask | show | jobs
by cool-RR 5352 days ago
I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... })

Then in your loop you're only making dict lookups.

3 comments

That's effectively what the cache is. Dictionary lookups are slow enough that we want to fast-path the case of English letters.
Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])
Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.
I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.

If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.

Python's lists are not linked lists. List indexing is O(1).
You are right, double checked in my Martelli.
Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).