Hacker News new | ask | show | jobs
by mikepavone 1457 days ago
> That statement has to be coming with some hidden caveats. 64 bits of address space is crazy huge so it's unlikely the entire range was even present. If only a subset of the range was "instantly" available, we have that now. Turn off main memory and run right out of the L1 cache. Done.

So I did some digging around for documentation about this machine and it looks like it puts the upper 54-bits of the address through a hash function to select an entry in a set associative tag RAM which is then used to select a physical page. This has the possibility for collisions, but it can get away with that because RAM is just a cache for disk contents.

Certain parts of the address technically mean something, but apart from leveraging that in the design of their hash function it has no real relevance to the way the hardware works. This scheme would work with linear 64-bit addresses just fine with an appropriate hash implementation. Basically all that's happening here is that the TLB is large enough to have an entry for reach physical page in the system and a TLB miss means you have to fetch a whole page from disk rather than walking a tree of page tables in memory.

I think the other thing going on here is that the R1000 is a microcoded machine from the 80s with no cache (well unless you're counting main RAM as cache, so it probably has a relatively leisurely memory read cycle which makes it more straightforward to have a very large TLB relative to the size of main memory. There's no magic here and no lessons for modern machines when it comes to how virtual address translation is done

1 comments

You are right that there is no lessons for "modern machines" as we build them now.

But that is precisely my point: Maybe there are better ways to build them ?

What I mean is that the R1000 memory architecture is not fundamentally different from modern hardware in a way that seems to solve any modern design problems. The tag RAM is functionally equivalent to the TLB on a modern CPU, but it's much larger relative to the size of the RAM it's used with. The 2MB memory boards described in US Patent 4,736,287 (presumably an earlier version of the 32MB boards present in the R1000/s400) have a whopping 2K tag RAM entries. This is the same size as the 2nd level data TLB in Zen 2 which is supporting address lookup for a much larger main memory.

If you were to try and make a modern version of the R1000 architecture you're going to run into the same size vs speed tradeoffs that you see in conventional architectures. The server oriented Rome SKus of Zen 2 support 4 TB max RAM. Even if you bump the page size to 4MB, you still would need 1M TLB/tag RAM entries to support that with an R1000-style implementation.

Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree.

What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

>Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree. > >What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

You would need a TLB even with a completely flat page table because hitting the DRAM bus (some flavor of DDR on modern systems, but it's still fundamentally DRAM) on every access would absolutely destroy performance on a modern machine even if translation itself was "free". You need translation structures that can keep up with the various on-chip cache levels which means they need to be small and hierarchical. You can't have some huge flat translation structure like you have on the R1000 and have it be fast.

Anyway, my point is that at a mechanical level TLB and tag RAM work the same way. You take a large virtual address, hash the upper bits and use them to do a lookup in a set-associative memory (so basically a hash table with a fixed number of buckets for conflicts). In some CPUs (it's a little unclear to me how common it is for cache to be virtually or physically addressed these days) this even happens in parallel with data fetch from cache just like tag RAM lookup on the R1000 was done in parallel with data fetch from DRAM. This is not some forgotten technique, it's just moved inside the CPU die and various speed and die space constraints keep it from covering all the physical pages of a modern system.

Now, could you perhaps use a more R1000-like approach for the final layer of translation, sure. Integrating it tightly with system memory probably doesn't make sense given the need to be able to map other things like VRAM into a virtual address space, but you could have a flat hashtable like arrangement even if it's just a structure in main RAM. You can even implement such a thing on an existing CPU with a software managed TLB (MIPS, some Sparc)

The TLB is an attempt to mitigate the horrible performance properties of a multi-level page-table-trees.

If you do away with the page-table-tree, there is no problem for the TLB to mitigate.

I'm sure there are improvements to be made but there are pretty fundamental physical reasons why reading a random piece of data from a large pool of memory is going to take longer than reading random piece of data from a small pool of memory. Hence, as memory pools have gotten bigger since the days of the R1000 we use caches both in memory itself and in address translation.