|
vpshufb only accesses bytes within the same 16-byte lane and has latency 1. Instructions that shuffle bytes across lanes tend to have higher latency. So at the 16th position, a shuffle value of 0 (or 16, or 32...) gets you the first byte (byte 0) from the other argument, but at the 17th position, a shuffle value of 0 instead gets you the 17th byte (byte 16, or byte 0 of the corresponding 16-byte lane, if you like) from the other argument. This isn't really desirable behavior. Rather, it's somewhat awkward behavior that one has to work with to avoid doing a sequence of more expensive operations. I am concerned that it takes like 6 instructions (2 to put copies of each lane into all lanes, 2 shuffles, cmpgt, blend) including a load and using an extra register to implement rearrange for use cases that would be fine with vpshufb for 32-byte registers and maybe 14 instructions (4 to put copies of each lane into all lanes, 4 shuffles, 3 compares, 3 blends) with 3 loads and 3 extra registers to implement rearrange for 64-byte registers. Maybe you can do better for the latter case though, I haven't had a thorough try at it. A common use of vpshufb is to get a mask of character positions from a string matching a small set of characters, which you can do with cmpeq(in_slice, vpshufb(c, in_slice)), where the input substring in_slice is used as the shuffle and c is chosen so that input characters in the set of characters of interest (e.g. {'\n' '\r' '\t' '\"' '\\'}) map to themselves and other characters map to something other than themselves. rearrange doesn't seem to define what happens when indices in the shuffle are negative or too large, so it seems like to use rearrange we need to mask off some bits and do cmpeq(in_slice, rearrange(c, and(d, in_slice))) where the `and` is only there because of the nit about invalid indices and `rearrange` is 1 or 6 or 14 instructions depending on what register width we're working with. That could be unfortunate. You don't need this stuff for sorting primitives though! |
It's a pity because you clearly have a lot of knowledge that you could put to use to provide very helpful feedback to implementors of the Vector API. (I'm doing this for the GraalVM compiler.)