Hacker News new | ask | show | jobs
by TinkersW 695 days ago
Simplest solution and the one I use is all SIMD related buffers use a custom allocator(actually everything uses it) and it always rounds the allocation size up to the SIMD width.

Masked loads kinda suck, they are a tiny bit slower and you now need a mask and you need to compute the mask..

1 comments

This is what I do too (in my case I don't round up the allocation size and just let loads & stores potentially see the next object (doing tail stores via load+blend+store where needed; only works if multithreaded heap mutation isn't required though)).

The one case it can be annoying is passing pointers to constant data to custom-heap-assuming functions - e.g. to get a pointer to [n,n-1,n-2,...,2,1,0] for, say, any n≤64, make a global of [64,63,...,2,1,0] and offset its pointer; but you end up needing to add padding to the global, and this materializes as avoidable binary size increase as the "padding" could just be other constants from anywhere else. Copying the constant to the custom heap would be extra startup time and more memory usage (not sharable between processes).