| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Jabbles 5350 days ago

In C you could replace

    if x in WHITELIST

with

    if ((x >= '0' && x <= '9') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= 'a' && x <= 'z'))

which I suspect would be much faster than any hash-table implementation. Also I believe that should work for UTF-8 as well as ascii. I realise that this makes it harder to expand the whitelist. I'm not very familiar with python, is there something similar that could be done?

5 comments

bbrizzi 5350 days ago

If you want to be picky, I'ld go with:

    if ((x >= 'a' && x <= 'z') ||
        (x >= 'A' && x <= 'Z') ||
        (x >= '0' && x <= '9'))

As in a regular text file you're more likely to hit an alphanumerical character than a number. If the first OR statement is true, any good compiler will skip the next two clauses.

schrototo 5350 days ago

> If the first OR statement is true, any good compiler will skip the next two clauses.

This has nothing to do with the compiler, it's part of the language specification. If your compiler doesn't comply with this, it's broken.

viraptor 5350 days ago

If you want to be super-picky, then char[256] with 1/0 values for each character will be faster than 6 comparisons in the worst case.

DrJokepu 5350 days ago

And if you want to be super-super picky, int[256] on 32-bit or long[256] on 64-bit would be even faster.

(Although I'm not sure if lookup would be faster at all than a few comparisons, considering caches and everything, but this is not my domain.)

viraptor 5350 days ago

A jump address table could be even faster since it's not branching...

mikeash 5350 days ago

Would it? Loading a single byte should still be fast (in the absolute worst case, it's a single shift and a mask on top of the cost of loading 4 or 8 bytes) and using less memory means it's much more likely for the table to remain in cache (and evict less other stuff). In fact, I'd go so far as to wager that a uint8_t[16] (or uint64_t[4] or whatever) with one bit per character, manually extracted, would be the fastest way to do it. Not that I've tested this or anything, and I could certainly be wrong.

maximilianburke 5350 days ago

I hate to break up perfectly good hand-waving with numbers but I put together a test. On my Core i7 with an 18mb input, which is a large novel duplicated many times, the performance factors, normalized to the if-based version are:

    if: 1x
    byte-based lookup: 0.83x
    int-based lookup: 0.84x
    bit-vector based lookup: 0.93x

http://pastebin.com/5twfXfEt

ori_b 5350 days ago

Unless it bloats your data to the point where other stuff falls out of L1 cache.

viraptor 5350 days ago

Let's not go too far with that. L1 cache started with 8K on i486. I'm talking about 256 bytes.

ori_b 5350 days ago

Current L1 caches are 32k. There's an awful lot of data that would like to be in there -- If you're going to talk about micro-optimization, it's something to keep an eye on.

sausagefeet 5350 days ago

You couldn't in C actually without making assumptions the language doesn't guarantee. The language specifies that you can't guarantee the alphabet is in order like that. Ctypes.h provides functions for checking letters though </pedantic>

icebraining 5350 days ago

Language? The alphabet order is related to the encoding, and UTF-8 (which is what they're probably using) certainly does ensure that, since it's backwards compatible with ASCII.

CountHackulus 5350 days ago

Sure, that works for ASCII and UTF-8, but what if I'm on an IBM System Z machine? Then everything's in EBCDIC, and that's just a nightmare.

ori_b 5350 days ago

Then you read in ASCII or UTF-8, and don't do your I/O in EBCDIC. Also, you run Z/Linux.

maximilianburke 5350 days ago

Then you write your own optimization for that case -- for most people assuming that 0-9, A-Z, and a-z will work just fine. Chances are slim that you're running on a system using EBCDIC without knowing it.

beej71 5350 days ago

Or, as was said, use ctypes.h, which is the Right Thing.

JesseAldridge 5350 days ago

You could do that in Python using the ord function.

    white_ranges = [('0', '9'), ('A', 'Z'), ('a', 'z')]
    any([ord(low) <= ord(x) <= ord(high) for low, high in white_ranges])

vilya 5350 days ago

In python there's no such thing as a char, there's only single-character strings. The above would actually be doing 6 string comparisons. That may still be faster... or may not. Measuring is the only way to be sure.

bjoernbu 5350 days ago

since there are only ascii chars in the whitelist you could just declare an array of size 256 (doesn't really matter if it's a bitvector or char array, int array or whatever) and replace in WHITELIST by

    return (CHAR_MAP_ARRAY[c] = 'y')

Any UTF-8 start or follow byte shoudl be > 128 where there should always be values indicating false