Hacker News new | ask | show | jobs
by brrrrrm 1514 days ago
What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least https://uops.info/table.html), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.

I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:

  __m512i spaces = _mm512_set1_epi8(' ');
  size_t i = 0;
  for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
    // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
    __m512i in0 = _mm512_loadu_si512(bytes + i);
    __m512i in1 = _mm512_loadu_si512(bytes + i + 64);
    __m512i in2 = _mm512_loadu_si512(bytes + i + 128);
    __m512i in3 = _mm512_loadu_si512(bytes + i + 192);

    __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
    __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
    __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
    __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);

    auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
    auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
    auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
    auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);

    _mm512_storeu_si512(bytes + pos, reg0);
    pos += _popcnt64(mask0);
    _mm512_storeu_si512(bytes + pos, reg1);
    pos += _popcnt64(mask1);
    _mm512_storeu_si512(bytes + pos, reg2);
    pos += _popcnt64(mask2);
    _mm512_storeu_si512(bytes + pos, reg3);
    pos += _popcnt64(mask3);
  }
  // old code can go here, since it handles a smaller size well

You can probably do better by chunking up the input and using temporary memory (coalesced at the end).