Hacker News new | ask | show | jobs
by zackmorris 2508 days ago
Oh cool, good to hear from you! If you happen to read this, I would like to give you a vote on the direction I'm hoping for. My background is in video games, blitters, MATLAB, basically swizzling large amounts of data in various DSP-like ways. A cache miss can easily cost 100 cycles, so processors are sitting around waiting for data 99% of the time sometimes. IMHO pretty much all software today is RAM bus bound, not CPU bound. And this was true in the late 90s, so is even more true today by about 2-4 orders of magnitude (that's why single-threaded CPU performance hasn't changed much since the early 2000s).

In the beginning, caching was a nice idea to get around bus latency. But the last processor design I saw in 1999 when I graduated college had about 75% of the chip devoted to cache. I would have much preferred to have 3 more cores. But the industry kept doubling down on that, so today we're still stuck with CPUs with maybe 4, 8 or 16 cores, but only one bus to RAM.

So to me, hardware caching has been a waste of time when it comes to parallel computation. I think a better model is a software cache running at a privileged level above userland. In other words, a smart MMU per CPU, written in microcode or software that presents a uniform address space to the application. Each MMU could/should have a hash for each (say 4k) page's data, and then each CPU/MMU core would act like a router, either returning the contents of a page for a given hash, or asking its 4 or 8 neighbors if they contain the page (returning it if so, otherwise asking the next neighbors):

https://en.wikipedia.org/wiki/Content-addressable_memory

From the software side, this would look like a single deduplicated virtual memory where each unique page would appear once in the address space, usually out in the mesh somewhere (not resident) unless its core had cached the page locally while its code operated on it. At most, a page would only be an x+y hop from any of the neighboring cores, and a thread in each core should block until the page is resident (just like with any other cache or virtual memory).

The next step is to deal with mutation via a copy-on-write mechanism:

https://en.wikipedia.org/wiki/Copy-on-write

https://en.wikipedia.org/wiki/Software_transactional_memory

So if an app has the same string allocated 10 times, only one page would hold the original data. The other 9 would be mapped to that first page via virtual memory. Everything could/should be immutable, so when a string is modified, it allocates a new page for the data with its own unique hash. The original string would eventually be deallocated through a least recently used (LRU) strategy, or perhaps each MMU could have some notion of garbage collection that preserves a single unique page somewhere in the mesh until all cores no longer reference it. A quick and dirty way to deal with this later would be to recommend that the mesh be hooked up to a large flash memory (over 1 GB) where old pages could be written to be stored indefinitely until storage runs out (the same as with traditional virtual memory).

The most elegant way I know to implement this would be to run something like Clojure in each MMU:

https://clojure.org/about/state

https://clojure.org/reference/refs

A more "hands on" way to think of this with ugly syntax is Redux:

https://www.freecodecamp.org/news/an-intro-to-redux-and-how-...

So basically the MMU would treat every read/write like a transaction and insulate the userland code from having to deal with routing and deduplication. Then each (say 32 bit) userland memory address would map to one of the CAM data hashes so that ordinary compiled code can run unmodified. All of the CAM and STM stuff should happen entirely in the MMU. If we're mapping 32 bit addresses to data hashed using a 160 bit (20 byte) SHA-1, then the MMU can treat that 20 byte hash like a pointer and utilize standard garbage collection algorithms. The easiest one I know is to refcount allocations, freeing the data when it's no longer referenced:

https://en.wikipedia.org/wiki/Automatic_Reference_Counting

The hardest part will be handling circular references. Unfortunately this is an open problem, but maybe you can borrow established techniques from something like Java:

https://www.google.com/search?q=garbage+collection+circular+...

A proper MMU implementation should be able to run as something like a kernel extension so that the whole chip can reach out to other chips in a cluster or the internet, using a convention similar to IPFS:

https://en.wikipedia.org/wiki/InterPlanetary_File_System

https://hackernoon.com/a-beginners-guide-to-ipfs-20673fedd3f

The goal I want to accomplish is to get away from the constraining SIMD that's been hyped by video cards, TensorFlow, etc and replace it with the general purpose MIMD of vector languages like MATLAB but in a parallelized way. From a user standpoint, we should be able to write code from vanilla C all the way up to rapid application development languages like Elixer or Go, where everything is synchronous-blocking, and pipe data to/from other processes via channels and the Actor model (the same as UNIX executables), so it looks like we're running in a single computer with say 256 cores and ~256 buses to small local RAMs so we can use high throughput techniques like MapReduce. Here are some breadcrumbs for what I'm getting at:

https://news.ycombinator.com/item?id=17419917

https://news.ycombinator.com/item?id=18294939

http://aggregate.org/MOG/

Anyway, sorry this got too long. It's a lot to digest, but I've been thinking about it for over 20 years. The takeway is that perhaps you can punt on the MMU for now, and research adding something like a Lisp machine (a simple stack machine) as a coprocessor to each ZipCPU core, ideally able to run a subset of Clojure directly. It doesn't have to be fast or efficient, because it will mostly only be called for cache misses, similarly to how virtual memory reads from slower storage. Except in this case, that slow storage is the mesh itself. The MMU just needs to be able to read/write content-addressable pages to each of its 4 or 8 neighbors in the mesh, maybe with dedicated data and hash lines, or else through a packet protocol similar to UDP or Rambus serialized to a smaller bus, raising interrupts when userland tries to read data that isn't resident. The rest can be implemented purely through software. To just "get it done" without an MMU, you could run your own microkernel on each CPU with a software MMU that maps the 32 bit address space to the CAM and handles the hash routing.

Or like, do whatever you want man :-) Hope this sparks some inspiration.

1 comments

I'm not sure I followed all of that, but let me at least try to look up the links and see what I can learn. Maybe they'll help me follow the concepts you are suggesting.

Dan

I realized after sleeping on it that I should clarify the core idea. With computer architecture, it's easy to get hung up thinking about things like I/O, busses, caching, etc. But I think the future is distributed content-addressable computing, where the state is just "out there" in the internet somewhere, and the runtime handles all the data locality issues so it looks like one continuous memory to the application. If we have that, then it's trivial to add processors either in the local mesh or even out on the internet somewhere.

Thinking in terms of cores connected to their 4 neighbors with a routing protocol like "do you have the data for hash ABC, otherwise ask your neighbors" or "store this data with hash XYZ" reduces the problem space to a key-value store similar to redis, so you don't have to worry about hardware caching.

For now, the network protocol could probably just be something like multicast. Packets would go to all neighbors, and they would be unconnected, unreliable like UDP. The header would either be a REQUEST with an N bit hash, or a RESPONSE with the N bit hash and the data associated with it. There might also be a TTL field that goes up to roughly sqrt(# of cores) so that packets diffuse through the network once but don't bounce around endlessly. I think that cores would be broadcasting new [hash][data] packets for every RAM write, and other cores would save all of them and evict their oldest hashes (which allows the cores to pre-cache data that's likely to be used soon). Then upon requesting a RESPONSE for a certain hash, a core would block until it arrived from the cluster in roughly sqrt(# of cores) cycles.

From a code standpoint, a page in the 32 bit address space would have a real pointer but it would map through the virtual memory manager (VMM) to one of these [hash][data] pages. So two processes in the cluster wouldn't have the same address for the same piece of data. This can be looked at as a feature, not a bug, because it enforces process separation similar to Erlang. So when a page's data is updated, the page gets swapped out by the VMM and re-hashed to a new content-addressable page out in the cluster. So it might make sense to have a coprocessor per-cpu that hashes pages for every RAM write. Or maybe the hashing could happen in the RAM write interrupt and we'd settle for the speed hit, not sure. Also it (might) be good to choose a hash that implements something where you can re-compute a partial range inside the hash rather than having to recompute the whole thing if a single byte changes. Like maybe a custom hasher could be designed that uses a divide and conquer strategy to find which half changed, and then the half in the half that changed, down to some level and append them like with the MD5 appending flaw where they make a document that looks like another document but has different data in the middle that both hash to the same hash so the documents have very different behavior. What I mean is, maybe there is a hashing algorithm that allows you to recalculate a half, or a quarter, or an eighth, down to some granularity, so the hashing could be optimized from O(n) to O(n/64) or whatever. Also after writing this out, I realized that the VMM would store (for SHA-1) a 20 byte hash for every 4k block, so roughly 1/200 of each core's RAM would go to this storage scheme.

Would this VMM be easier to build than an MMU? I don't know, but I do know that software is cheaper than hardware so it's maybe possible.