| At Dream11 we love GraphQL :) It has made the lives of both — service owners and clients (Android, iOS, Web) much easier! Over the years we had packed the server with almost all the graphQL optimizations we could find on the internet. The blog outlines some of the key optimizations we had put in to improve the performance of our application code (Which doesn't have a lot to do with GraphQL, as most people have already commented). I want to still give a bit of an "insider's perspective" as much as I can, so here it goes — 1. The graphQL team that did the optimizations had two engineers who were actively working on it. It seemed like a futile project at first. The goal was to find low-hanging fruits (if any) and prepare for our peak season (IPL 2021) but eventually, find other long-term alternatives. Killing graphQL altogether and moving that logic on the clients was still on the table. Fortunately, the team did a fantastic job of optimizing it so much that we are now committed to supporting it long-term. 2. We try to keep our microservices as discrete, pointed, and as unopinionated as possible. We also indulge the clients by letting them query huge amounts of data at once. All this makes our graphQL layer seriously complex. There is a huge amount of computation that happens on this layer. To get some perspective our /health call to the server is 10x faster than the most requested graphQL query. Needless to say, it's not a fair comparison because unlink the query, health doesn't make any network calls, or has any practical CPU load. 3. We have caching implemented on our graphQL clients, however, the reason we get such a high request rate, is because our concurrency is also very high. A typical user is barely making 10 requests in a minute but overall we achieve millions of requests in a second. 4. As a part of the long-term strategy, we did consider using Rust as our choice of the stack. We had heard a lot of noise about how RUST was beating all the benchmarks. So we did some POCs internally and implemented a part of our graphQL service in Rust. What we learned was that the Rust implementation was ~2.5x faster than our node.js implementation and also consumed relatively less memory. This was fine but wasn't good enough for us to migrate our large node.js codebase, and learn a completely new stack. Building a team with domain expertise in Rust in India is particularly hard. 5. It might seem like we are not pushing the production servers hard enough, you'd be surprised to know that it's true! Because our traffic is very unpredictable we like to maintain a comfortable CPU utilization for every possible extreme scenario that our Data Science team can predict. The risk of our edge layer going down is seriously revenue hitting. So even when our benchmarks say we can push the systems 5x more, the final call remains with Site Reliability Teams and the risk appetite we have for that particular game. 6. The blog briefly also talks about using multiple ELBs, to which we distribute traffic using DNS. The problem with DNS is that it doesn't guarantee a truly uniform distribution of the traffic. Even with a very low TTL, sometimes we observe a difference of more than 20% in requests/sec between two ELBs at an instant. This and other infrastructure-specific nuances have to be considered by the SRE teams to estimate capacity on production. 7. Lastly, the servers we use on production are small machines — 8 cores for the majority of our stack. This lies in the goldilocks area where we get the best cost to performance ratios. Scaling down or up the machine type has a significant impact on the cost. It's been a journey of love and hate with graphQL and we continue to invest in making our edge robust and even faster.
Feel free to connect with us on — https://twitter.com/D11Engg |
This is the new link. Some how the old link doesn't work if you have the app installed.