Hacker News new | ask | show | jobs
by 1996 2905 days ago
Amazon performance is generally bad when using other services.

We are rolling out a CDN, with a goal of 20 ms latency in most countries. We want more granularity that AWS - and some zones are just not well served (No AWS in Africa, incomplete offer in Brasil, etc)

Still, we figured we would use Route 53 as you can do Latency Based Routing even with non-AWS servers. Computing latency or using EDNS0 as a proxy is not rocket science, so we thought the DNS would not be a limiting point.

Oh boy, how wrong we were! After wrongly blaming the bad performance on Cloudflare caching, further tests revealed Route 53 takes as much as 0.7s to reply to some DNS queries - and even worse when fronted by Cloudflare, as for some reason the DNS TTL seems to be ignored by Cloudflare. The latency only drops down after about 4 queries, which makes me thing they have some kind of Round-Robin that does not share the DNS queries (I could be wrong)

In the article, the author says: "Most of that delay is DNS however (Route53?). Just showing the time spent waiting for a response (ignoring DNS and connection time)". No you should not ignore the DNS delays! Route53 performance is very bad - 2 full seconds for you!!

We are fortunate it did not take 2s for us. Still, having servers all over the world that reply in 20 ms is useless when the first DNS query takes 700ms.

We ended up leaving for Azure: Traffic Manager outperforms Route 53 by a factor of 2.

Eventually, we will roll our own GeoIP with DNS resolvers on a anycast subnet.

I do not understand how this level of "performance" can be tolerated. At 2 seconds for a DNS query, you are better off using the registrar free DNS service!!

1 comments

Saying Route53 takes “2 seconds” to resolve is pretty meaningless without a distribution or at least percentiles. Route53 obviously doesn’t take 2 seconds for all or most queries.
I am quoting the author, and their analysis of the initial query. This observation from whoever wrote the article matches my own experimental results: initial queries are very very slow on Route 53 LBR. A distribution of queries is useless and misleading, as later queries are cached if you have a sufficient TTL - so only the first few really matters in the performance results.

Later queries are fast of course, as the results are cached (TTL).

Even if the DNS is very pooly configured, all queries after the first one will benefit from the cache!! So the first few queries matter much more, and this is what we should be talking about instead of distributions and percentiless.

Said differently: If each of your visitor has to way a second or two until the site comes up the first time, then the site works normally, it may still give them a bad impression.

I measured the DNS delay on first Route53 reply to be over 700 ms personally. For the author it is 2000 ms. These results are in the same order of magnitude, and make Route53 unsuitable for many applications. Of coutse, you could start hacking, like keeping Amazon cache warm by issuing queries through chron, or by setting extremely long TTLs and hoping your visitors DSL modem will keep your A records in cache as long as you asked for - but these are just hacks trying to compensate that the first DNS query takes SECONDS to process.

Route53 LBR DNS is not as a "slow and requiring hacks". It's supposed to be fast, simple to run, and to ingrate with different ecosystems. To me, it seems to be none of that.

After assessing Route53 as fubar, I switched from AWS to Azure: TrafficManager offers the same features, and the first request takes less than 350ms. There must still be some cruft in there, but at least it is manageable.