Hacker News new | ask | show | jobs
by azernik 1871 days ago
Essentially:

They have one address range for shared (i.e. subject to syncing across all replicas) memory, and a separate one for non-shared (single-replica) memory.

Cross-replica data is presumably subject to their agreement algorithm, checking that the different computers reach the same (within some error bars) results; you want to arrange things so that there are frequent checkpoints at which the conflict resolution system can say "a bad write happened at this point, I should disregard whatever this computer said from that point until it recovers".

i.e. you want local memory to use as scratch space for performance reasons, but to make sure that there isn't a long runway for errors to happen and decisions to be made before the shared-memory checker notices a mistake. To ensure this happens, you want manual control over which memory allocator handles which data.

1 comments

That makes sense. One part I'm still not clear on is how you accomplish a "restore" to fix the broken state of a process with a bitflip. Is it enough to simply copy all the shared state memory over as a block and jump into executing it? That seems like it would require the invariant that shared memory never references private memory, and I'm not sure how to statically enforce that.
"Restore" is reboot. Usually these are called "watchdog circuits", which you may have heard of from more mundane embedded applications.

Once you've rebooted, yeah, you need to copy over the shared state from another of the processes.

You probably have hardware that watches ECC flags. For a correctable one-bit flip, it triggers a read-and-then-write. For a two-bit flip, it might just kill and restart the process, or reset the whole machine. As long as it doesn't happen too often, it's fine: the whole system (constellation and ground nodes) are designed to accommodate such events.