Hacker News new | ask | show | jobs
by mikepavone 1457 days ago
What I mean is that the R1000 memory architecture is not fundamentally different from modern hardware in a way that seems to solve any modern design problems. The tag RAM is functionally equivalent to the TLB on a modern CPU, but it's much larger relative to the size of the RAM it's used with. The 2MB memory boards described in US Patent 4,736,287 (presumably an earlier version of the 32MB boards present in the R1000/s400) have a whopping 2K tag RAM entries. This is the same size as the 2nd level data TLB in Zen 2 which is supporting address lookup for a much larger main memory.

If you were to try and make a modern version of the R1000 architecture you're going to run into the same size vs speed tradeoffs that you see in conventional architectures. The server oriented Rome SKus of Zen 2 support 4 TB max RAM. Even if you bump the page size to 4MB, you still would need 1M TLB/tag RAM entries to support that with an R1000-style implementation.

1 comments

Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree.

What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

>Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree. > >What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

You would need a TLB even with a completely flat page table because hitting the DRAM bus (some flavor of DDR on modern systems, but it's still fundamentally DRAM) on every access would absolutely destroy performance on a modern machine even if translation itself was "free". You need translation structures that can keep up with the various on-chip cache levels which means they need to be small and hierarchical. You can't have some huge flat translation structure like you have on the R1000 and have it be fast.

Anyway, my point is that at a mechanical level TLB and tag RAM work the same way. You take a large virtual address, hash the upper bits and use them to do a lookup in a set-associative memory (so basically a hash table with a fixed number of buckets for conflicts). In some CPUs (it's a little unclear to me how common it is for cache to be virtually or physically addressed these days) this even happens in parallel with data fetch from cache just like tag RAM lookup on the R1000 was done in parallel with data fetch from DRAM. This is not some forgotten technique, it's just moved inside the CPU die and various speed and die space constraints keep it from covering all the physical pages of a modern system.

Now, could you perhaps use a more R1000-like approach for the final layer of translation, sure. Integrating it tightly with system memory probably doesn't make sense given the need to be able to map other things like VRAM into a virtual address space, but you could have a flat hashtable like arrangement even if it's just a structure in main RAM. You can even implement such a thing on an existing CPU with a software managed TLB (MIPS, some Sparc)

The TLB is an attempt to mitigate the horrible performance properties of a multi-level page-table-trees.

If you do away with the page-table-tree, there is no problem for the TLB to mitigate.