| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by goosethe 94 days ago

LLVM is able to auto-vectorize the generated IR extremely well. There are no branches to mis-predict, so, theoretically, it just blasts through the data.

Since it emits standard LLVM IR, LLVM handles the actual instruction set targeting. Right now in v0.1.0, the compiler hardcodes a SIMD width of 8 (assuming AVX2). However, parameterized SIMD widths are already on the roadmap for v0.4.0. Once that is added, you will be able to pass a --target-width flag to compile down to narrower vector units (like SSE on older CPUs) or up to AVX-512 and ARM NEON.

There are strictly no loopholes for loops inside the compute kernels. Inside a shader block, execution is 100% linear. However, the host application calling the pipeline effectively acts as the loop over the data elements. To help, we allow linear accumulators: You consume these with a fold operation, which the compiler lowers into a lock-free parallel reduction tree rather than a traditional for loop.

The memory model is a host-owned static arena where your host application allocates a flat, contiguous block of memory and passes that pointer to Lockstep_BindMemory(ptr). Lockstep does all its reads and writes exclusively within that allocated buffer. Because it doesn't have arbitrary pointers, it can't reach outside that arena, which is exactly how we mathematically guarantee the noalias pointer optimizations in LLVM.