Hacker News new | ask | show | jobs
by MBCook 4841 days ago
The thing I don't understand about all this is why they were unable to new servers online. After everything hit the fan, they announced they were working on it and had gotten two new servers up the day before.

Two servers?

It just seems unbelievable to me that the backend was designed in a way that it couldn't be scaled out any faster than that. Since each region is a discrete unit, you'd think they should be able to move them between servers.

Was it all intertwined? Did the regions, stats, achievements, and DRM all run out of the same database? Were they not separate services?

They had to know this game would be popular, they've been pushing it for months (to great effect). It's a major property and the first release in about a decade.

Then there is EA. Even if Maxis couldn't figure this out (and I doubt that), EA has online experience. They're the publisher for Mass Effect, Madden, Fifa, NCAA, and more. They should have the resources, the people, and the experience to have prevented this.

If you completely ignore the DRM or the seemingly unimportant always-online requirement, it this whole thing still seems botched. There were multiple groups who should have known better and prevented this. My understanding is that they got some warning signs during the beta.

I would kill for a postmortem blog or article on Gamasutra explaining why they couldn't scale out faster; to know what decision was the lynchpin that held them back.

2 comments

Each game "Server" for SimCity is actually an Amazon EC2 cluster of servers, with 1 central master DB server. Even when the servers were "full" on game launch, all of the EC2 servers were responding to requests normally - it was the cluster's master DB server that was slow. All of the "servers" are actually in the UK Amazon EC2.

This brings us to the scalability problems and why regions/cities are not shared to all servers. The database is the bottleneck, so sharing regions between servers would only worsen performance.

That makes it even more baffling why they couldn't bring up more servers.

If it's all just a chef/puppet based infrastructure in EC2, you should be maybe 20-30 minutes away from pumping out a new 'server'. One is as easy as ten, at that point.

We're talking about EA here. You need to include the latency required to go through enough bureaucracy layers to approve the expenditure of funds on another cluster.
Not to say that there isn't bureaucracy in a company their size, but the launch of a game this size is a really big deal. There's a tremendous of PR, and they're getting scathed. Polygon downgraded their review of 9.5 down to 4.0 because of the server issues. I think if they could cut a decent size check to fix the issues, they would. The problem is likely in engineering, like a database that isn't scaling.
> The problem is likely in engineering, like a database that isn't scaling.

If it were that simple, just cut the number of users per cluster and throw 10 more up.

> I think if they could cut a decent size check to fix the issues

Doubtful within the context of a quick fix, but it is likely the root issue. See above simple solution that takes 30 minutes to roll out. EA is not a company run by engineers, its not a company run by people that understand anything about engineering. What sounds like a simple solution to us that can easily be implemented by throwing money at it and reaping the customer goodwill is completely foreign to a company like that. You may as well be speaking Klingon when you make the recommendation to just throw new clusters at it.

I've met enough people who worked at EA to know they have competent engineers and managers -- they are not completely inept. When hundreds of millions of dollars are suddenly at risk, you have a clear channel straight to the CEO to get the resources that you need.

I'm giving EA the benefit of the doubt that they ruled out 30-minute fixes. I can't see how any of us can really speculate as to how long it should take to fix when we don't really know any details. For example, if it was a database bottleneck, would you commit to walking in and fixing it in 30 minutes? Or even 30 hours? I think you'd want to know the details, because the scope can easily be off by 1-2 orders of magnitude.

Databases many times are the bottlenecks of infrastructure. You can't just spin up new instances easily to fix the problem. Many times you have to completely re-architect your database schema and architecture to handle to increased demand .
a DB bottleneck and lousy regionalization seem like exactly the kind of thing EA's experience in this area would prevent. What gives?
While I don't know for certain, I can't possibly imagine that they've just added two physical machines. Small websites operate with more servers than that, let alone a AAA title from a major studio. I don't think they literally mean two servers - it sounds like they're trying to make the explanation simpler for consumers. What they're probably referring to is a self-contained cluster of machines that runs all of the necessary services that the game relies on - databases, notifications, social rankings, etc.

That being said, all of this wouldn't be needed if they'd just release an offline mode. The upcoming DLC content to unlock expected features (bigger cities, more transportation options, etc) is bad enough, but the always on requirement just makes this game impossible for me to buy.

"While I don't know for certain, I can't possibly imagine that they've just added two physical machines."

I can totally imagine that. Be afraid. Be very afraid. :)

I've actually seen entire companies and universities run on two physical machines, for sufficiently large values of two physical machines.

When I was working there, the University of Oregon ran almost all of the routine computation and webpages on one big Sun box that was about as fast as my calculator, and one enormous Vax box that was about 8086 level speed.

Your explanation sounds more plausible. It could be more akin to Blizzard adding 2 new realms to their realm list, rather than 2 physical computers.