Hacker News new | ask | show | jobs
by stephengillie 4883 days ago
It's like each solar system is a VM, and they can't move their VMs to a new physical server without disconnecting all clients. And all of their physical servers are at 100% load all of the time? Oh, I guess it's 100% utilization, not 100% load. As in, they don't spin down servers to save power during off-hours I guess.

The time dilation is a neat solution to the server load problem, but it's sooo annoying as a player. In beta, it was interesting to watch the entire game desync and grind to a halt, but we could still chat and look around. Interesting, but frustrating.

4 comments

Two main solutions (given that the article mentions that they have huge machines already):-

A) Allow a single solar system to span multiple machines. Very hard, especially if the server software isn't architected for this. Retrofitting this can be nigh on impossible.

B) Have a few huge machines that can be used to host scenarios like this and, more importantly, have a way of migrating users over to the huge machine seemlessly.

The latter can be done but it's tricky, especially if transferring game state between instances of the server is not simple (I'm not talking about transferring the VM itself with something like vMotion). It comes down to:-

1) Being able to make the bigger machine act as a temporary proxy pushing connections data back to the smaller machine.

2) Having a way of telling clients to make a new connection to the bigger machine and, once that connection is made (and the data is being proxied to the smaller machine) cut the connection to the smaller machine. Users see no loss of service or reconnects at all.

3) Once all clients are now being proxied by the bigger machine; pause and transfer the game state from the smaller machine to the big machine and then continue. Obviously it works best if a chunk of state can be transferred in the background and then the final transfer (and pause) is as short as possible in order to transfer over the bang up to the minute state.

Option (A) is always the proverbial "In v2 of the server we'll do it a completely different way..."

Which never seems to come around, because the new hardware is so much faster that it can host what were previously problematic server loads without a problem, and you've got a million other things to write.

Yet players have a tendency to figure out when places are too overcrowded to be fun. So your old problematic load is almost never representative of how many players wanted to be in that area, but merely how many players were willing to put up with that level of degraded performance.

So upon release (or sufficiently close to it to start stress testing, which is conveniently when it's too late to really change architecture) the new limits are quickly hit.

They've been struggling with the issue of multi-thousand player fights for a while now, and have moved towards both of these solutions but are obviously not quite there yet.

For instance, the article actually talks about having said huge machines. There's a way in EVE to inform the GMs about anticipated big fights, at which point they'll do the reinforcement preemptively. In this case, there wasn't such a convenient warning.

I find this post particularly interesting, since what described (outside of doing it at a VM level), somewhat reflect how some Telecom providers build their equipment. Telecoms in North America are properly crazy when it comes to recovering from failure with minimal visible impact to customers.

Usually on the Telecom equipment, the backup / state transfer is done at a process level, not at a VM level as suggested, but it's quite common practice.

The best equipment I've seen, does this by spawning many equivalent processes, and distributing them among the available blades in the chassis. If you have process mgr1, you get a backup1 process on another blade. As mgr1 processes you're call state, it checkpoints all critical data to the backup1 process. If the mgr1 process itself crashes, or the entire blade fails, all the processes are simply re-spawned, contact their corresponding backup process, and transfer all the state information back, and simply resume. Most end users won't even notice. Using this method, I've seen equipment recover well over 30,000 subscriber sessions in under 5 seconds, most of which probably wouldn't even notice, and even if you did it wouldn't be enough to drop you're data connection (VPN, video streaming, or whatever you're doing). We also don't lose you're bill for the usage either ;)

The challenges with applying this to the game environment, is in telecom each user session is independent, and doesn't really interact with other sessions, so we don't have an issue of a single process becoming overloaded and needing to free up resources to handle it. However, it would be properly easy to do within this model, since failure is expect to occur and be recovered from.

As a programmer, you have to be properly diligent in the software design, what get's check pointed, when does it occur. I couldn't even imagine trying to retroactively apply this type of design to "legacy" software, that wasn't build from the ground up with this model in mind.

Option 4C) Create a raspberry pi cluster farm, and host every player with one microserver.
How would the microservers keep up to date with each other fast enough to be useful? "eventual consistency" doesn't work with real time strategy games.
How would they not? Latency between servers would be near zero, and if they all handled point to point transactions while broadcasting their state on each transaction, you're getting all the power you need. We're not talking about complex processing here. It's an RNG coupled with simple health and xyz.
Broadcasting their state to what? All the other nodes? So each "microserver" would still need to handle thousands of connections and data for every player?
The real problem for them that forces them to use time dilation is the fact that a solar system can only run on one core. They have a few machines with a 4.4 Ghz or greater processor I believe for system that are high load but currently they are extremely limited in what they can do hardware wise.
>It's like each solar system is a VM, and they can't move their VMs to a new physical server without disconnecting all clients. And all of their physical servers are at 100% load all of the time?

I wasn't sure how to interpret that either.

I've always been a bit surprised that the the backend systems for these MMOs aren't a bit more flexible. Though, EvE did launch nearly 10 years ago and it has never had a huge number of subscribers.

I have no idea whether more recent games have solved these problems.

I have no idea whether more recent games have solved these problems.

Most games solve this problem with completely separate servers and population caps. I know in Guild Wars 2 they actually stop displaying players past a certain number to improve performance. This works fine in calm areas like cities, but in PvP it causes issues with being killed by "invisible" groups that the game fails to load in time.

I may be remembering incorrectly, but last I remember reading CCP was basically in uncharted territory on the tech front as far as EVE Online is concerned. No other game developer has even tried to tackle this problem at the level they have, and I can't think of many applications in the world that could comparable in scale and complexity to what the EVE servers have to deal with.

Edit: To summarize, even being 10 years old, no game has come close to matching it in this context.

Guild Wars 2 is stupid about this. At least on release, they stopped displaying ENEMIES FIRST. So if you're in a big enough group, a couple enemies can walk up and none of you can see them b/c you hit the limit from displaying allies, but they can still see some of you guys and kill you while invisible.

If they prioritized displaying just party members (parties only go up to 5 people) and maybe people on your friends list, and then enemies, it would have worked better.

They supposedly just fixed this in the patch yesterday. As a player who gets owned by unrendered thieves, I'm not convinced. -_- (OTOH, I'm also pretty bad at PvP anyways, so whatever.)
400k+ active players IS a huge number. Specially considering they are in the same logical server. Other MMO's handle this by having copies of the world and splitting the population - which is the same approach used by Ultima Online back in 1997.

So actually, EVE is way ahead of the rest, technology-wise.

>So actually, EVE is way ahead of the rest, technology-wise.

That doesn't seem like a sensible comparison considering how much the gameplay of EvE differs from the typical MMO.

In any case, I'm not making a dig at EvE, rather acknowledging that the game has had a long life and has never been a multi-million subscriber behemoth. Therefore, they might make concessions, or allowed the persistence of previous limitations / decisions in design/infrastructure for lack of resources [1][2].

1: http://news.ycombinator.com/item?id=5135873

2: http://massively.joystiq.com/2008/09/28/eve-evolved-eve-onli...

That doesn't seem like a sensible comparison considering how much the gameplay of EvE differs from the typical MMO.

The "typical" MMO doesn't even try to handle large populations. The use of separate servers, population caps and login queues are how just about every one else deals with congestion problems.

Even from your second link there is the quote "Working with IBM, the EVE server cluster is maintained in London and is currently the largest supercomputer employed in the gaming industry." Sure that is from 2008, but newer MMOs are built using the same overall server architecture you saw back in the days of Ultima Online.

400k subscribing users is a great big pile of cash to work with.
I wouldn't call over 400,000 players '[not] a huge number'. It's the 2nd or 3rd most popular paid-subscription MMO.
As you can see here http://eve-offline.net/?server=tranquility , there is about 44k people online at this point, probably a bit less since most people have more than one account.

That seems like a fairly substantial amount of users in a single world. Is there any other game world with this many active users in a single world?

The specific, relative number of subscribers is not the point.

That said, I wouldn't call something huge when there are comparables (even if it's just one or two) which dwarf it and others which (at some point) exceeded it.

I can't think of any games that deal with PVP on this scale either.
If memory serves, I believe that they mean that the servers run at 100% capacity, not 100% load ie. CCP does not throttle back system resources during periods of light use.