Hacker News new | ask | show | jobs
by saynsedit 3505 days ago
Nah, correct solution is simply to use memcpy(), works on all compilers, all platforms, all versions, with SSE and with any flags specified:

  #include <stdlib.h>
  #include <stdint.h>

  uint64_t sum (char *p, size_t nwords)
  {
      uint64_t res = 0;
      size_t i;
      for (i = 0; i < nwords; i += 8) {
        uint64_t tmp;
        memcpy(&tmp, &p[i], sizeof(tmp));
        res += tmp;
      }
      return res;
  }
1 comments

Nitpick: memcpy is string.h not stdlib.h, the type was uint32_t not uint64_t and you are making some unwarranted assumptions about sizeof(uint64_t), not to mention that the existence of this type is merely implementation defined ;)

Deal breaker: your memcpy invocation requires a sufficiently smart compiler to convert into normal unaligned load on x86 and seems to prevent GCC autovectorization. In this case OP actually didn't want vectorization, but in general it happens that such workarounds confuse compilers and produce worse code.

I'm not sure I understand your deal breaker. For the platform he was targeting it produces optimal code, for other platforms it's merely slower (but not specifically slower, since the compiler is likely not a great optimizer across the board).

Vectorization is in general not applicable here since it usually requires aligned memory... not all implementations do, but most. In any case, benchmarking is more appropriate than armchair optimizing.

You are writing convoluted code and hoping that your compiler will figure it out and convert it internally to the form I posted. Sometimes it does, sometimes it doesn't. In this case it generates reasonable code but doesn't vectorize it for some reason. WTF.

I prefer to just add alignment specification and move on, assuming I don't care about portability. If portability matters, reread my original post ;)

It's not convoluted. It's actually clear and well-defined making it easier to reason about.

I'd call compiler specific alignment attributes more arcane, convoluted, and susceptible to future bugs.

Vectorization isn't a panacea. You need to benchmark to be sure, lacking that I expect GCC to be better at optimizing code than you. If you disagree, please manually write a vectorized one that handles non-aligned addition and post your results :)