| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cool-RR 5352 days ago
	I wonder whether it'll be more efficient to just have a big dict mapping every character to what it should look like post-escape. (e.g. {'a': 'a', '(': '&#(;', ... }) Then in your loop you're only making dict lookups.

3 comments

pavpanchekha 5352 days ago

That's effectively what the cache is. Dictionary lookups are slow enough that we want to fast-path the case of English letters.

link

Dylan16807 5352 days ago

Yeah, I wish the test text was posted to try. A custom dict that adds unicode characters on demand would have such a nice simple implementation that it was my first thought when looking at the problem. A couple lines in the class and then only one to call. "".join([REPLACEMENTS[c] for c in s])

link

cool-RR 5352 days ago

Or possibly faster than a dict lookup: An array or a list of strings where the index number is the ordinal number of the character. So `array[ord('(')] == '&#(;'`.

link

gbog 5352 days ago

I think I remember that list indexing "list[x]" is o(n), while hash access "hash[x]" is o(1), but my Martelli is not here around.

If I were the guy, I'd first ensure the final escaped result is cached in a key-value store, then I'll check if this func is the real bottleneck. If so, I might also have tried accessing it's content as a byte array. Then, if the file is of Asian origin (ascii being very low minority), I'd bulk escape it with the "&#x%s;" trick. It is rare to have documents with even mix of ascii/latin and other glyph, so I makes sense to have two functions, like he did.

link

kbd 5352 days ago

Python's lists are not linked lists. List indexing is O(1).

link

gbog 5352 days ago

You are right, double checked in my Martelli.

link

pwang 5352 days ago

Unfortunately, the ord() function call overhead in Python may swamp the difference in speed between a dict lookup (hashtable) and a list lookup (ordinal index).

link