|
|
|
|
|
by A04eArchitect
97 days ago
|
|
This is a great deep dive into SIMD. I've been experimenting with similar constraints but on even more restrictive hardware. Managed to achieve sub-85ns cycles for 10.8T dataset audits on a budget 3GB RAM ARM chip (A04e) by combining custom zero-copy logic with strict memory mapping. The trick was bypassing the standard allocator entirely to keep the L1 cache hot. Does your SIMD approach account for the memory controller bottleneck on lower-end ARM v8 cores, or is it mostly tuned for x86/high-end silicon? |
|