Hacker News new | ask | show | jobs
by qb45 3361 days ago
> Wow, that is an explosive conclusion. It's very hard for me to come to terms with. 40 MB per second is the sustained read of a spinning platter hard drive

Parent is talking about random access. So compare with random access to spinning rust :)

40MB/s for random RAM access is totally reasonable. Dynamic RAM (DRAM), the kind of RAM used in computers nowadays, is organized and accessed in "rows" of few kB. If you read random addresses, chances are good that almost every read will miss all CPU caches and hit a DRAM row other than any currently opened row (there is maybe a few dozen rows out of millions opened at any time, depending on the number and internal organization of RAM modules). Opening and closing a new row takes tRP+tRAS which is 13+35ns on some random DDR3 RAM I have laying here. This is 20M individual accesses per second.

https://en.wikipedia.org/wiki/Dynamic_RAM

2 comments

What do you mean by "tRP + tRAS"?

I now understand how it's reasonable, as in, correct. But I don't understand the fundamental reason for this. Okay, so every time a row is read, if it's not in cache it'll get cached. But why does it have to be that way?

Couldn't there be a mode, "hey don't fully open these rows, I just one want one random byte as fast as possible!"

I compared it with spinning disks just to show how unreasonable the total is. I realize that the whole design isn't built around this idea of picking off a byte at a time.

But don't you think there could be applications that have PRECISELY, exactly this usage pattern?

For example, what percent of your neurons are firing at the moment? Very, very low.

For some future applications, getitng a 10x speedup in random memory reads of single bytes might totally increase that application by a lot. Even if desktops aren't built this way today, I'm super-surprised that when the whole system isn't doing anything else, there is no way to get that kind of raw access without asking for whole rows at a time.

> Couldn't there be a mode, "hey don't fully open these rows, I just one want one random byte as fast as possible!"

As fast as possible is exactly tRP+tRAS. Since the whole row is read in parallel to RAM's internal SRAM buffer, opening only part of it would make no difference.

> What do you mean by "tRP + tRAS"?

Ever heard of RAM timings? I'm afraid at some point you will have to read how DRAM works to understand more. There was a link in my last post.

It's this way because in the 80s/90s computer architects simulated different kinds of CPU/memory system designs running existing C programs and measured that it's best to focus on caching and compromise on main memory random access. Then CPU vendors made such systems and they outsold/performed cacheless systems. And after that memory module standardization kept the direction, because memory cost per byte was more in demand than random access performance.

Yes, there are of course workloads that don't like that. But programs adapt to hardware over time too, so co-evolution has weeded out these access patterns from high-performance programs that can be structured differently.

You could make a computer that uses DRAM differently, but it would be expensive because you couldn't use mass market memory modules.

(Exception: some CPUs use in-package fast DRAM as last level cache).

There have been some custom hardware supercomputer designs (Tera MTA line) that were optimized for cache hostile workloads.

Super, super dumb question here, but now that memory is approaching an incredibly low $/byte ratio, would it be possible to use a large portion of the available memory to "index" the location of bytes to improve access time?

I'm fairly certain it's impossible to create a full and accurate index while still having a useful amount of RAM left over but could you perform some kind of extreme compression, even hashing each "row"? That way a CPU could eliminate X number of rows in its search due to the likelihood that the hash couldn't be generated if it contained that byte of data. I'm obviously a layman but I think that if statistical branch prediction works, there must be a way to use excessive amounts of memory to make random access into a predictive process.

Not sure what you mean. You always know which row contains any given address, the problem is that "seeking" in DRAM takes tens of nanoseconds (for contemporary chips), regardless of DRAM's clock speed or DDR/GDDR/LPDDR 1/2/3/4/5/6/7. Seeking in your "index" would take time too.

The only way to get good performance from DRAM is to always write data sequentially in the same order they will be read. Then you get full sequential throughput both for writing and reading and this is many GB/s and keeps increasing with clock speed. But that's a software optimization.

Otherwise caching is beneficial, but the cache has to be SRAM to have low random access latency. SRAM is physically larger and power hungry, half of a modern CPU is cache and it's still only a few MB.

CAS latency. About 10 cycles for an access outside current row. That makes the RAM work at best around effective 200 MHz if you are latency bound.