Hacker News new | ask | show | jobs
by goodplay 3633 days ago
I would be very happy if Niantec releases an in-depth write up on the how the backend was implemented and what measures they took to make it scale to handle this kind of traffic.

I don't care much for the game, but details of why the servers weren't (aren't?) able to handle the traffic interests me immensely.

12 comments

I look forward to what they can further do to improve things.

For instance they don't cache pokestop images, my phone keeps redownloading the same images from pokestops and gyms, if it just cached the top 10 most visited pokestops and gyms that would be a decent reduction in data use.

Also it would help if they offloaded some of the work of the GPS onto the local phone steptracking/direction trackers it would help.

I can understand if they don't want to send coordinates of "nearby pokemon" down to the clients because that would inevitably get hacked, but the phone has to repeatedly poll.

If the phone tracked more locally it could not poll GPs until it detects locally that it's moved enough, so if you weren't moving you wouldn't be updating your position very often.

Also I read something that reckons that the game servers are currently all hosted in the US, which would explain the laggy battle experience in Europe. (Freezing at 1hp left, never being quite sure what's hit your character or good times to attack).

Offloading the GPS work to the step/direction trackers in the phone would also be great for those of us that work in big old buildings that do a great job of blocking my GPS position.
More on the images, since the images are just the ones that were popular during Ingress (I mean the same images, like some of the ones I took a few years ago in Ingress are now in Pokemon), they probably could have done even more aggressive caching for the very popular spots.
FWIW I'm in the US and I get the laggy/soft locking up experience you mentioned all the time. I think they just didn't write the app to be very reselient to and RPC failures.
If the phone tracked more locally it could not poll GPS until it detects locally that it's moved enough, so if you weren't moving you wouldn't be updating your position very often.

That's how the Android Location service works, unless you intentionally bypass it. You ask the location service to update you when it determines the user has moved X feet.

I'm pretty sure they had immense scaling capabilities, but were hitting upon bottlenecks which were unexpected. I'd be also really interested in what those were.
It's just not economically viable to scale your architecture to day-one or week-one traffic requirements. Look at Blizzard games for comparison, a new release every year or so and they're NEVER equipped to handle it. They know that by the end of the month, user count will drop sharply before stabilizing at a much more manageable level.
I have to disagree with regards to Overwatch, by far Blizzard's biggest IP in recent memory. The launch was perhaps the smoothest I've ever seen in a popular online game. A tiny minority of people had problems, in stark contrast with completely unprepared and utterly botched launches such as Battlefield 4. Blizzard did themselves a huge service with the open beta(which had over 9 million people play over the course of a week!) giving them an idea of where bottlenecks would crop up.
> a new release every year or so and they're NEVER equipped to handle it

Which is 100% untrue for Overwatch. They've clearly learned their lessons.

If this is true, your horizontal scaling is the problem. Not taking advantage of those users with auto scaling is leaving money on the table.
For every Blizzard there's a Netflix, who never has any problems with a big release nowadays.
Blizzard had very few issues with their largest launch in recent history.
Right but it's odd that "very few issues" for a game (overwatch in this case) includes multiple hours of outage in like the first week or two. (Though like you point out, it is way less than other popular releases)

There's just a staggering difference between the number of 9s offered by game companies vs more reliable parts of the industry. Shit most online games still don't have login servers that scale easily to meet dynamic load.

Another good example of this are the Steam sales, especially the big ones (summer and christmas sales render the store completely unusable for the first day or so)

It's "good enough" for the week of the sale, upgrading all the servers for a week of super heavy traffic just to see them idle the rest of the year isn't a great business plan

There are several firms that offer computing capacity online, at much more granular terms than "the rest of the year". It seems these "cloud" computing services are common topics of discussion here on HN.
The calculus has to be how much it would cost to temporarily bring in, run, maintain e.g. AWS boxes versus how much revenue is left on the table by the slow store. I presume Blizzard has run those numbers and decided it wasn't worth it.
Someone posted a list of protobuf definitions [1] used in Pokemon Go a few days ago. At the time I didn't think the submission was very interesting on its own right, but it seems relevant to your question now -- you can examine the names of the definitions to guesstimate what logic is performed on the server.

[1] https://news.ycombinator.com/item?id=12081447

I suspect one reason for heavy traffic is that a lot of the logic you would expect to implement on the client, actually needs to execute on the server, in order to avoid client based hacks. For example the logic of whether the poke ball successfully catches the Pokemon. Sure you could implement that on the client but then you'd have third party apps cheating at the game without any verification from the server.
You can have both, if it's just randomness then you can have shared seeds which the server will verify but the local client can sample too. If they de-sync then the server wins, but the client can feedback quicker before the final response from the server.

I suppose that would enable further cheats such as not wasting pokeballs on ones you know will fail by peeking into the future.

The throw itself is client-side though, so I suspect it's already possible to cheat with "perfect throws" every time.

From my experience with the game, I believe their progression margin is quite large.

For example, they don't cache anything client side (even Pokestop assets), nor do they timeout/retry/prefetch after a network disconnect during a Pokemon capture (which is very annoying on spotty 3G networks). Last but not least, battery consumption is insane. Don't go out without a spare battery! Backend side, I also noticed a few cases of "blinking pokemon" in the nearby grid. Surely symptomatic of synchronisations issues between instances of distributed systems (dirty reads?).

I enjoy the game a lot, but regarding the number of issues, it's still a beta.

One of the reasons is probably that people are using Pokemon Go in countries where it hasn't been released yet. For example here in The Netherlands there's 1.3 million+ players, even though the game is not even available yet in the Android or iOS app stores in The Netherlands.
I'm currently traveling in Hungary and was able to download the game on the first day of release. Is that because my Apple account has a US billing address?
I use Android myself but I've heard from colleagues that if you use a US account you can download it here. For Android it's just downloading executable through any channel that isn't Google's play store.
They also had a tonne of markets come online when they weren't expecting it - the softlaunch in Oz/NZ would have been to work through those, but once the world came online they couldn't hold it back, so might have had to accelerate the roll-out faster than they'd have liked? Technicalities of it will be interesting.
I think Niantic learned a very valuable lesson about soft-launching an incredibly anticipated game
They seem to be using Google Cloud and Java (https://www.nianticlabs.com/jobs/). My best guess is that they are using Google App Engine.
That's client side; they're using the unity game engine there.
I may be naive, but Firebase may be the perfect fit for this kind of task scince it scales pretty good. Niantec would probably get massive discounts as well as an Ex-google company...
Very or even Ingress on a website like highscalability.com
What's more interesting is the way Ingress checks for link intersections at that scale (and it checks it for each key in inventory).