Hacker News new | ask | show | jobs
by zbjornson 1764 days ago
You could pad the input as you said, which avoids a "tail loop," but otherwise you usually do a partial load (load <8 elements into a vector) and store. Some instruction set extensions provide "masked load/store" instructions for this, but there are ways to do it without those too.

To your last question specifically, if you _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll get a segfault. Otherwise you'll get undefined values (which can be okay, you could ignore them instead of dealing with a partial load).