Hacker News new | ask | show | jobs
by jamesaguilar 5687 days ago
Maybe. You need 256 bits for all the possible chars, and this solution is perhaps a little less intuitive. Also, the shift + and + cmpz is not necessarily going to be significantly faster than an address lookup + cmpz because of the latency of getting the array from memory or cache. Though it will use less cache space compared to the size of modern caches that's not enough of a win to matter much I would guess.
2 comments

I was assuming a 26 letter alphabet. At that point the shifting, anding and comparing would take place in registers.
True, although if there are lower case letters too it could become a problem. At any rate, I'm not sure that the register aspect would improve speed significantly because of the latency of getting the string from main memory. You're still going to get pipeline stalls as you reach portions of the string that are not in the cache.
Yeah, the registers will only be faster than manipulating some sort of bool array. Reading the original string (with associated cache stalls) will certainly be necessary.
Doing it this way will use the optimal amount of memory (in the worst case) (assuming an O(n+m) answer). There are exactly 26 bits of state that need to be tracked (where "bit" is used in the information theoretic sense of the word). If you choose to store this state as 26 actual bits, then you win.

Multiplying together the first 26 primes requires that you can store a number up to 232862364358497360900063316880507363070. log2 of that number is about 127.5. So you need 128-bits of storage.

Essentially, the correct solution is to sort both lists, and then do a single pass through checking for unique items in one list. Once you have this solution, you can do some micro-optimizations: - note that the set of symbols is limited, so you can use a counting sort (or other non comparison based "sort") - note that the sorting function doesn't need to be an actual sorting algorithm, and may discard duplicates.

If you apply these two optimizations, you should end up with the bit-field approach (or maybe a slightly more memory hungry one that doesn't store stuff in individual bits... but algorithmically it doesn't bring anything new to the table).

For some reason, a minority of people are then thinking that a further optimization is to use a more inefficient data structure in the sorting algorithm for storing whether or not a letter has been seen. This has the effect of a more expensive read operation, write operation, more memory usage and provides no other benefits.

If you're marking used letters in a mask, why sort first?