Hacker News new | ask | show | jobs
by logicallee 3361 days ago
>worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open column mostly defied)

>That's about 40 megabytes per second

Wow, that is an explosive conclusion. It's very hard for me to come to terms with. 40 MB per second is the sustained read of a spinning platter hard drive

http://hdd.userbenchmark.com/ (click any line)†

(Did I say 40? I meant 160 MB/sec...) So more like 1/4 of the sequential read speed off of a spinning platter of rust. Ten seconds is insane.

I don't care how many times you're bouncing back and forth and invalidating caches and pipelines and prefetches and schedulers, you simply shouldn't be able to ruin things that badly. It is off from what I would expect by (easily) an order of magnitude.

I know you say that the main loop is very light - but aren't there other aspects to your build system and operating system that might be affecting this test? To say something very obvious, couldn't the Operating System scheduler not be giving your process the appropriate number of cycles? There is a lot more that I could say in this direction but let's just do something simpler:

-> Could you try your experiment without an operating system?

For example here are some people who booted a raspberry pi without an operating system -

https://www.google.com/search?q=chess+without+an+operating+s...

Perhaps before going that far you could simply boot into a Linux image that was simply not compiled with any hardware support to do anything. (After all you really don't need to do anything except return to the shell.) Or simply see what happens if you boot into Linux and try it.

If you get an instantly different result simply booting Linux on the same hardware then you instantly have an explosive blog post: "summing 100k 4-byte integers randomly takes 10 seconds on Mac OS X but only 1 second under Linux".

I realize there is a HUGE difference (HUGE) between sequential and random. But I just wanted to get across how insanely slow 40 MB/second straight to RAM is. That should not be possible, no matter how much you defy caches and scheduling and so forth, unless you get the Mac to swap pages out of RAM onto an SSD or something! So not using an operating system would really help here.

Could you try it? I'm not saying I don't believe you but - wow, that is insane.

† I just noticed you wrote "Macbook air 2011". If you want to look at 2011 hard drive speeds, a quick glance still sees some quoting 140 MB/sec so it still seems correct to me, but I just quoted 2017 figures.

3 comments

> Wow, that is an explosive conclusion. It's very hard for me to come to terms with. 40 MB per second is the sustained read of a spinning platter hard drive

Parent is talking about random access. So compare with random access to spinning rust :)

40MB/s for random RAM access is totally reasonable. Dynamic RAM (DRAM), the kind of RAM used in computers nowadays, is organized and accessed in "rows" of few kB. If you read random addresses, chances are good that almost every read will miss all CPU caches and hit a DRAM row other than any currently opened row (there is maybe a few dozen rows out of millions opened at any time, depending on the number and internal organization of RAM modules). Opening and closing a new row takes tRP+tRAS which is 13+35ns on some random DDR3 RAM I have laying here. This is 20M individual accesses per second.

https://en.wikipedia.org/wiki/Dynamic_RAM

What do you mean by "tRP + tRAS"?

I now understand how it's reasonable, as in, correct. But I don't understand the fundamental reason for this. Okay, so every time a row is read, if it's not in cache it'll get cached. But why does it have to be that way?

Couldn't there be a mode, "hey don't fully open these rows, I just one want one random byte as fast as possible!"

I compared it with spinning disks just to show how unreasonable the total is. I realize that the whole design isn't built around this idea of picking off a byte at a time.

But don't you think there could be applications that have PRECISELY, exactly this usage pattern?

For example, what percent of your neurons are firing at the moment? Very, very low.

For some future applications, getitng a 10x speedup in random memory reads of single bytes might totally increase that application by a lot. Even if desktops aren't built this way today, I'm super-surprised that when the whole system isn't doing anything else, there is no way to get that kind of raw access without asking for whole rows at a time.

> Couldn't there be a mode, "hey don't fully open these rows, I just one want one random byte as fast as possible!"

As fast as possible is exactly tRP+tRAS. Since the whole row is read in parallel to RAM's internal SRAM buffer, opening only part of it would make no difference.

> What do you mean by "tRP + tRAS"?

Ever heard of RAM timings? I'm afraid at some point you will have to read how DRAM works to understand more. There was a link in my last post.

It's this way because in the 80s/90s computer architects simulated different kinds of CPU/memory system designs running existing C programs and measured that it's best to focus on caching and compromise on main memory random access. Then CPU vendors made such systems and they outsold/performed cacheless systems. And after that memory module standardization kept the direction, because memory cost per byte was more in demand than random access performance.

Yes, there are of course workloads that don't like that. But programs adapt to hardware over time too, so co-evolution has weeded out these access patterns from high-performance programs that can be structured differently.

You could make a computer that uses DRAM differently, but it would be expensive because you couldn't use mass market memory modules.

(Exception: some CPUs use in-package fast DRAM as last level cache).

There have been some custom hardware supercomputer designs (Tera MTA line) that were optimized for cache hostile workloads.

Super, super dumb question here, but now that memory is approaching an incredibly low $/byte ratio, would it be possible to use a large portion of the available memory to "index" the location of bytes to improve access time?

I'm fairly certain it's impossible to create a full and accurate index while still having a useful amount of RAM left over but could you perform some kind of extreme compression, even hashing each "row"? That way a CPU could eliminate X number of rows in its search due to the likelihood that the hash couldn't be generated if it contained that byte of data. I'm obviously a layman but I think that if statistical branch prediction works, there must be a way to use excessive amounts of memory to make random access into a predictive process.

Not sure what you mean. You always know which row contains any given address, the problem is that "seeking" in DRAM takes tens of nanoseconds (for contemporary chips), regardless of DRAM's clock speed or DDR/GDDR/LPDDR 1/2/3/4/5/6/7. Seeking in your "index" would take time too.

The only way to get good performance from DRAM is to always write data sequentially in the same order they will be read. Then you get full sequential throughput both for writing and reading and this is many GB/s and keeps increasing with clock speed. But that's a software optimization.

Otherwise caching is beneficial, but the cache has to be SRAM to have low random access latency. SRAM is physically larger and power hungry, half of a modern CPU is cache and it's still only a few MB.

CAS latency. About 10 cycles for an access outside current row. That makes the RAM work at best around effective 200 MHz if you are latency bound.
Hi! Memory is definitely an issue today, especially with big fat server processors...

Here is the small program I wrote, first allocates an array, then writes a random placed linked list that touches all the array cells (e.g. start->|4|6|2|5|3|end , so it goes to cell 1, then 4, then 5, then 3,2,6 and done.

Then it reads the same array from start to finish, just to sum contents. This is fast and goes to 6-8 GB/s, depending on unrelated memory pressure from video and os tasks. This is the advertised speed. So I don't think it's a swap issue (and my SSD is way faster :-) I've also tried with smaller and bigger arrays. Just ensure it stays in real memory.

Moving around in a big array is very hard on the memory controller: - no prefetch possible; - row and column changes at every access (almost), so commands have to be issued to burst terminate, close row, close bank, precharge maybe, open bank, activate row, fetch address, and who knows; - no advantage from interleave; - no advantage from a fat 64 or 128 bit bus;

It is the combined time needed for each and every memory read (due to randomization) that cause performance to drop.

This is the code I've been using. I've modified it to work on unices (OSX has an issue with timeval struct), but have not tested on them. It compiles, may need to be altered a bit. Launch it with ./a.out <array_size> <random seed>. You will note that, as soon as the array moves out of caches, hell breaks loose. Works with gcc or clang.

https://godbolt.org/g/qX39tL

Thank you. This is very good and thank you for sharing. You should probably copy the usage information from your comment here to the top of the file, so they're not separated.

I tried it and got the same result you suggested (I used a code running site but it shouldn't matter, the VM it runs in should have virtualized operations that shouldn't slow it down by a complete order of magnitude so I roughly trust the results I'm seeing.)

FYI, here is the output with 10m items. (It times out if I try to increase it further, but that is just because the code-running site kills programs that run longer than a set time)

  ht_size: 10000000
  sizeof(int): 4
  seed: 256
  start run
  roll 64: -2004260032
  time (seconds:+nanoseconds) 1s:270464 ns
  sequential run: roll 64: -2004260032
  time (seconds:+nanoseconds) 0s:281244 ns
so if that is 1.27 sec for 10mill it should obviously be around 12 seconds for 100 million. Just brutal. I had no idea things could be so bad.

Thanks so much for sharing your program with us!

Your parent poster doesn't provide details, but this is not really crazy.

From https://gist.github.com/jboner/2841832, main memory latency is 100 ns; so we get 10 million fetches per second. You could get 40 MB by e.g. assuming that you have 4 one-byte requests in flight at the same time.

As they say, RAM is the new disk.

Hi, I've posted a few details and a sample program over here. As you correctly notice, latency is a big issue on modern machines, way worse (in proportion) than it was in the late eighties.

But as soon as the wheels start turning, they spit out a lot of bytes, for sure!