| >worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open column mostly defied) >That's about 40 megabytes per second Wow, that is an explosive conclusion. It's very hard for me to come to terms with. 40 MB per second is the sustained read of a spinning platter hard drive http://hdd.userbenchmark.com/ (click any line)† (Did I say 40? I meant 160 MB/sec...) So more like 1/4 of the sequential read speed off of a spinning platter of rust. Ten seconds is insane. I don't care how many times you're bouncing back and forth and invalidating caches and pipelines and prefetches and schedulers, you simply shouldn't be able to ruin things that badly. It is off from what I would expect by (easily) an order of magnitude. I know you say that the main loop is very light - but aren't there other aspects to your build system and operating system that might be affecting this test? To say something very obvious, couldn't the Operating System scheduler not be giving your process the appropriate number of cycles? There is a lot more that I could say in this direction but let's just do something simpler: -> Could you try your experiment without an operating system? For example here are some people who booted a raspberry pi without an operating system - https://www.google.com/search?q=chess+without+an+operating+s... Perhaps before going that far you could simply boot into a Linux image that was simply not compiled with any hardware support to do anything. (After all you really don't need to do anything except return to the shell.) Or simply see what happens if you boot into Linux and try it. If you get an instantly different result simply booting Linux on the same hardware then you instantly have an explosive blog post: "summing 100k 4-byte integers randomly takes 10 seconds on Mac OS X but only 1 second under Linux". I realize there is a HUGE difference (HUGE) between sequential and random. But I just wanted to get across how insanely slow 40 MB/second straight to RAM is. That should not be possible, no matter how much you defy caches and scheduling and so forth, unless you get the Mac to swap pages out of RAM onto an SSD or something! So not using an operating system would really help here. Could you try it? I'm not saying I don't believe you but - wow, that is insane. † I just noticed you wrote "Macbook air 2011". If you want to look at 2011 hard drive speeds, a quick glance still sees some quoting 140 MB/sec so it still seems correct to me, but I just quoted 2017 figures. |
Parent is talking about random access. So compare with random access to spinning rust :)
40MB/s for random RAM access is totally reasonable. Dynamic RAM (DRAM), the kind of RAM used in computers nowadays, is organized and accessed in "rows" of few kB. If you read random addresses, chances are good that almost every read will miss all CPU caches and hit a DRAM row other than any currently opened row (there is maybe a few dozen rows out of millions opened at any time, depending on the number and internal organization of RAM modules). Opening and closing a new row takes tRP+tRAS which is 13+35ns on some random DDR3 RAM I have laying here. This is 20M individual accesses per second.
https://en.wikipedia.org/wiki/Dynamic_RAM