Hacker News new | ask | show | jobs
by alexkus 4883 days ago
Two main solutions (given that the article mentions that they have huge machines already):-

A) Allow a single solar system to span multiple machines. Very hard, especially if the server software isn't architected for this. Retrofitting this can be nigh on impossible.

B) Have a few huge machines that can be used to host scenarios like this and, more importantly, have a way of migrating users over to the huge machine seemlessly.

The latter can be done but it's tricky, especially if transferring game state between instances of the server is not simple (I'm not talking about transferring the VM itself with something like vMotion). It comes down to:-

1) Being able to make the bigger machine act as a temporary proxy pushing connections data back to the smaller machine.

2) Having a way of telling clients to make a new connection to the bigger machine and, once that connection is made (and the data is being proxied to the smaller machine) cut the connection to the smaller machine. Users see no loss of service or reconnects at all.

3) Once all clients are now being proxied by the bigger machine; pause and transfer the game state from the smaller machine to the big machine and then continue. Obviously it works best if a chunk of state can be transferred in the background and then the final transfer (and pause) is as short as possible in order to transfer over the bang up to the minute state.

Option (A) is always the proverbial "In v2 of the server we'll do it a completely different way..."

4 comments

Which never seems to come around, because the new hardware is so much faster that it can host what were previously problematic server loads without a problem, and you've got a million other things to write.

Yet players have a tendency to figure out when places are too overcrowded to be fun. So your old problematic load is almost never representative of how many players wanted to be in that area, but merely how many players were willing to put up with that level of degraded performance.

So upon release (or sufficiently close to it to start stress testing, which is conveniently when it's too late to really change architecture) the new limits are quickly hit.

They've been struggling with the issue of multi-thousand player fights for a while now, and have moved towards both of these solutions but are obviously not quite there yet.

For instance, the article actually talks about having said huge machines. There's a way in EVE to inform the GMs about anticipated big fights, at which point they'll do the reinforcement preemptively. In this case, there wasn't such a convenient warning.

I find this post particularly interesting, since what described (outside of doing it at a VM level), somewhat reflect how some Telecom providers build their equipment. Telecoms in North America are properly crazy when it comes to recovering from failure with minimal visible impact to customers.

Usually on the Telecom equipment, the backup / state transfer is done at a process level, not at a VM level as suggested, but it's quite common practice.

The best equipment I've seen, does this by spawning many equivalent processes, and distributing them among the available blades in the chassis. If you have process mgr1, you get a backup1 process on another blade. As mgr1 processes you're call state, it checkpoints all critical data to the backup1 process. If the mgr1 process itself crashes, or the entire blade fails, all the processes are simply re-spawned, contact their corresponding backup process, and transfer all the state information back, and simply resume. Most end users won't even notice. Using this method, I've seen equipment recover well over 30,000 subscriber sessions in under 5 seconds, most of which probably wouldn't even notice, and even if you did it wouldn't be enough to drop you're data connection (VPN, video streaming, or whatever you're doing). We also don't lose you're bill for the usage either ;)

The challenges with applying this to the game environment, is in telecom each user session is independent, and doesn't really interact with other sessions, so we don't have an issue of a single process becoming overloaded and needing to free up resources to handle it. However, it would be properly easy to do within this model, since failure is expect to occur and be recovered from.

As a programmer, you have to be properly diligent in the software design, what get's check pointed, when does it occur. I couldn't even imagine trying to retroactively apply this type of design to "legacy" software, that wasn't build from the ground up with this model in mind.

Option 4C) Create a raspberry pi cluster farm, and host every player with one microserver.
How would the microservers keep up to date with each other fast enough to be useful? "eventual consistency" doesn't work with real time strategy games.
How would they not? Latency between servers would be near zero, and if they all handled point to point transactions while broadcasting their state on each transaction, you're getting all the power you need. We're not talking about complex processing here. It's an RNG coupled with simple health and xyz.
Broadcasting their state to what? All the other nodes? So each "microserver" would still need to handle thousands of connections and data for every player?