| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kiru_io 370 days ago
	Can you elaborate or show a better way?

5 comments

fedsocpuppet 370 days ago

   julia> using Random, BenchmarkTools
   julia> function isvowel(c)
            idx = (c | 0x20) - Int('a')
            return (0x00104111 & (1 << idx)) != 0
         end

   julia> hasvowel(str) = any(c -> isvowel(Int(c)), str)

   julia> @btime hasvowel(s) setup=(s=randstring("bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ0123456789", 10))
   15.739 ns (0 allocations: 0 bytes)
   false

some 77 thousand times faster

link

camel-cdr 370 days ago

The usual way to do this in SIMD is via vpermb(lut, str) == str, or potentially slightly faster vpshufb(str>>1, lut) == str.

str|0x32 if you want case-insensitive.

This works because the lowest five bits, more specifically bits two to five, of the vowels are distinct.

The lut is zeroed except for the indices corresponding to the vowels, which contain it self: lut[vowel&31]=vowel or lut[(vowel&31)>>1]=vowel

link

teo_zero 369 days ago

> The lut is zeroed except for the indices corresponding to the vowels, which contain it self

So in the comparison vs. the original str, vowels get a true and consonants get a false. Doesn't the "==" then returns true if all the chars are vowels?

link

camel-cdr 369 days ago

I used the C operators as short cuts to denote SIMD operations. So the == would be element wise and return a mask of the vowels in the input vector.

link

ninkendo 370 days ago

I was curious if there were any interesting bit-masking patterns in ASCII for vowels that could be exploited, so I shamelessly asked ChatGPT and it gave a pretty nice insight:

- Setting bit 5 high forces lowercase for the input letter

- masking with `& 31` gives an index from 0-25 for the input letter

Then you can the `bt` instruction (in x86_64) to bit-test against the set of bits for a,e,i,o,u (after lowercasing) and return whether it matches, in a single instruction.

It came up with this, which I thought was pretty nice: https://godbolt.org/z/KjMdz99be

I'm sure there's other cool ways to test multiple vowels at once using AVX2 or AVX-512, I didn't really get that far. I just thought the bit-test trick was pretty sweet.

Chat transcript is here (it failed pretty spectacularly the first couple times, tripping over AT&T syntax and getting an off-by-one error, but still pretty good) https://chatgpt.com/share/684c8b39-a9c4-8012-8bb6-74e1f8b6d0...

link

mystified5016 370 days ago

See my other comment.

You can do extremely trivial binary checks on ASCII, UTF-8 (and most? other) encoding schemes. All vowels including W and Y contain a 1 in the lowest bit. Then by comparing bits 1-4, you can do a trivial case-insensitive comparison. You detect case by checking bit 5. 0 is upper, 1 is lower.

link

rurban 369 days ago

How does that fly with foreign scripts? It certainly won't work with Greek, Arabic, Korean, Chinese, Japanese and the Indian languages.

link

IAmBroom 367 days ago

Again, this isn't really about detecting vowels; it is exactly about detecting upper- and lower-case examples of the characters a, e, i, o, and u.

Language is not an issue; ø and é are not considered "vowels" by this article (nor is y).

link

rurban 367 days ago

Sure, there is only ASCII, and nobody needs more. ø and é are not considered "vowels", ha!

Rust, Go, C?

Or even Lua if you don't want to go to a statically compiled typed language?

link

yongjik 370 days ago

TBH any mainstream language (and most non-mainstream ones) will be better than Python at this. Python can be pretty useful, but it's not known for having reasonable performance.

link