| Oh cool, good to hear from you! If you happen to read this, I would like to give you a vote on the direction I'm hoping for. My background is in video games, blitters, MATLAB, basically swizzling large amounts of data in various DSP-like ways. A cache miss can easily cost 100 cycles, so processors are sitting around waiting for data 99% of the time sometimes. IMHO pretty much all software today is RAM bus bound, not CPU bound. And this was true in the late 90s, so is even more true today by about 2-4 orders of magnitude (that's why single-threaded CPU performance hasn't changed much since the early 2000s). In the beginning, caching was a nice idea to get around bus latency. But the last processor design I saw in 1999 when I graduated college had about 75% of the chip devoted to cache. I would have much preferred to have 3 more cores. But the industry kept doubling down on that, so today we're still stuck with CPUs with maybe 4, 8 or 16 cores, but only one bus to RAM. So to me, hardware caching has been a waste of time when it comes to parallel computation. I think a better model is a software cache running at a privileged level above userland. In other words, a smart MMU per CPU, written in microcode or software that presents a uniform address space to the application. Each MMU could/should have a hash for each (say 4k) page's data, and then each CPU/MMU core would act like a router, either returning the contents of a page for a given hash, or asking its 4 or 8 neighbors if they contain the page (returning it if so, otherwise asking the next neighbors): https://en.wikipedia.org/wiki/Content-addressable_memory From the software side, this would look like a single deduplicated virtual memory where each unique page would appear once in the address space, usually out in the mesh somewhere (not resident) unless its core had cached the page locally while its code operated on it. At most, a page would only be an x+y hop from any of the neighboring cores, and a thread in each core should block until the page is resident (just like with any other cache or virtual memory). The next step is to deal with mutation via a copy-on-write mechanism: https://en.wikipedia.org/wiki/Copy-on-write https://en.wikipedia.org/wiki/Software_transactional_memory So if an app has the same string allocated 10 times, only one page would hold the original data. The other 9 would be mapped to that first page via virtual memory. Everything could/should be immutable, so when a string is modified, it allocates a new page for the data with its own unique hash. The original string would eventually be deallocated through a least recently used (LRU) strategy, or perhaps each MMU could have some notion of garbage collection that preserves a single unique page somewhere in the mesh until all cores no longer reference it. A quick and dirty way to deal with this later would be to recommend that the mesh be hooked up to a large flash memory (over 1 GB) where old pages could be written to be stored indefinitely until storage runs out (the same as with traditional virtual memory). The most elegant way I know to implement this would be to run something like Clojure in each MMU: https://clojure.org/about/state https://clojure.org/reference/refs A more "hands on" way to think of this with ugly syntax is Redux: https://www.freecodecamp.org/news/an-intro-to-redux-and-how-... So basically the MMU would treat every read/write like a transaction and insulate the userland code from having to deal with routing and deduplication. Then each (say 32 bit) userland memory address would map to one of the CAM data hashes so that ordinary compiled code can run unmodified. All of the CAM and STM stuff should happen entirely in the MMU. If we're mapping 32 bit addresses to data hashed using a 160 bit (20 byte) SHA-1, then the MMU can treat that 20 byte hash like a pointer and utilize standard garbage collection algorithms. The easiest one I know is to refcount allocations, freeing the data when it's no longer referenced: https://en.wikipedia.org/wiki/Automatic_Reference_Counting The hardest part will be handling circular references. Unfortunately this is an open problem, but maybe you can borrow established techniques from something like Java: https://www.google.com/search?q=garbage+collection+circular+... A proper MMU implementation should be able to run as something like a kernel extension so that the whole chip can reach out to other chips in a cluster or the internet, using a convention similar to IPFS: https://en.wikipedia.org/wiki/InterPlanetary_File_System https://hackernoon.com/a-beginners-guide-to-ipfs-20673fedd3f The goal I want to accomplish is to get away from the constraining SIMD that's been hyped by video cards, TensorFlow, etc and replace it with the general purpose MIMD of vector languages like MATLAB but in a parallelized way. From a user standpoint, we should be able to write code from vanilla C all the way up to rapid application development languages like Elixer or Go, where everything is synchronous-blocking, and pipe data to/from other processes via channels and the Actor model (the same as UNIX executables), so it looks like we're running in a single computer with say 256 cores and ~256 buses to small local RAMs so we can use high throughput techniques like MapReduce. Here are some breadcrumbs for what I'm getting at: https://news.ycombinator.com/item?id=17419917 https://news.ycombinator.com/item?id=18294939 http://aggregate.org/MOG/ Anyway, sorry this got too long. It's a lot to digest, but I've been thinking about it for over 20 years. The takeway is that perhaps you can punt on the MMU for now, and research adding something like a Lisp machine (a simple stack machine) as a coprocessor to each ZipCPU core, ideally able to run a subset of Clojure directly. It doesn't have to be fast or efficient, because it will mostly only be called for cache misses, similarly to how virtual memory reads from slower storage. Except in this case, that slow storage is the mesh itself. The MMU just needs to be able to read/write content-addressable pages to each of its 4 or 8 neighbors in the mesh, maybe with dedicated data and hash lines, or else through a packet protocol similar to UDP or Rambus serialized to a smaller bus, raising interrupts when userland tries to read data that isn't resident. The rest can be implemented purely through software. To just "get it done" without an MMU, you could run your own microkernel on each CPU with a software MMU that maps the 32 bit address space to the CAM and handles the hash routing. Or like, do whatever you want man :-) Hope this sparks some inspiration. |
Dan