Hacker News new | ask | show | jobs
by fulldecent2 2013 days ago
Thanks for your interest. If you have any advice on other instructions or M1 optimizations, I'd love to hear.

My first thought is to synchronize effort of the 8 CPU and maybe even the 8 GPU. That's +12dB right there. We have a multithreaded implementation in the project already.

1 comments

No, sorry, I don't know anything about the M1.

If you want to achieve something similar to _mm_stream_xxx, ie. bypassing the caches and causing bursts of DRAM traffic, try making some uncached/write combined memory mappings and writing to them. I don't know how this can be done in user space. You could try creating memory mapped buffers with OpenGL or Metal, with certain arguments you could get an uncached mapping.

Another option is looking at ARM instructions for memory barriers and cache flushes. ARM's selection of instructions for dealing with caches is much richer than x86's.