Hacker News new | ask | show | jobs
by A04eArchitect 97 days ago
This is a great deep dive into SIMD. I've been experimenting with similar constraints but on even more restrictive hardware. Managed to achieve sub-85ns cycles for 10.8T dataset audits on a budget 3GB RAM ARM chip (A04e) by combining custom zero-copy logic with strict memory mapping. The trick was bypassing the standard allocator entirely to keep the L1 cache hot. Does your SIMD approach account for the memory controller bottleneck on lower-end ARM v8 cores, or is it mostly tuned for x86/high-end silicon?