Hacker News new | ask | show | jobs
Lessons learned from running GraphQL (blog.dream11engineering.com)
62 points by arbobmehmood 1734 days ago
13 comments

The big items in that list of performance issues don't seem to have anything to do with GraphQL. They seem to be related to a heavily function style using lots of immutability and the Ramda library. I'd also suspect that these choices are responsible for the GC issues due to lots of allocations for the immutable objects.

I know, premature optimization and all that. But I think at the point where you're going for microservices because of scaling you really should also look into the lower level issues like that from the start. You should notice that the shiny library you're using is 100x slower than just writing plain code. And you should be aware of excessive allocations in hot paths.

If you're using microservices, it may be a good oppurtunity to try a language that supports immutability and the heavy functional style without the loss of performance. Something like Elixir, Haskell, OCaml, Rust.
At around one million requests per second, imperative programming becomes affordable; and more importantly, necessary.
Do I read this right? 1,000,000 requests/sec across 7,500 instances is only 133 requests/second, and graphql wouldn’t typically represent the business logic or data layer.

I love me some graphql, but that seems to be a very low figure. I’m curious how complex the queries are and what else these servers are doing.

I've seen similar situations where each host was achieving just 16 requests/sec.

Engineers are expensive, and growing more so every year. It's hard to justify time spent to optimise rather than throwing more instances at it. The cloud has made this worse in a way, since provisioning more hosts can be done so easily.

Not many engineers even have the skill to identify and resolve performance problems, so again, people just keep adding more machines. Long term, the problem slowly builds all over the system and the bill becomes mind-boggling.

I do think that we (the software folks) don't help ourselves here. We build frameworks and tools that are still far too hard to inspect. What to watch (and how to optimise) in production, is often never considered deeply when building or documenting the hot new thing.

Well, adding machines allows you to postpone fixing the problem.

It's perfectly reasonable, once the bill starts to catch up to your budget you spend time on optimizations :)

EDIT: the only solid argument against throwing machines at the problem is that: scaling something across multiple servers is hard. If you spent energy on performance, maybe you didn't have to.

Not sure why this is downvoted, there are def some issues here. I find the same, there are many things like GraphQL but even SQL libraries or entire frameworks that are not easy to performance-test, or perhaps they are easy to performance test but hard to resolve.

I also agree that most of the devs I have ever worked with, in the UK, have little to no idea about how to actually test performance effectively.

Even though I am personally really interested in performance, even using a cool tool like Resharper profiler takes some time to get your head round.

I get it, but at this scale I think we are talking in the order of a million bucks a year. I don’t know the situation in India, but I imagine that buys a lot of engineers.
From my own experiences building/scaling graphql, this is embarrassingly low. The poor performance is definitely coming from poor code/lib selection.
Ok, we've taken "at scale" out of the title above.
I wonder if they feel like GraphQL was worth it, vs. normal API servers. Maybe they saved some dev time on the front end, but did that outweigh the dev time spent on building, optimizing, etc? Somehow I doubt it.
I’ve been using graphql for years. In my experience, it dramatically simplifies microservice API architecture vs “normal” API servers.

Graphql is super easy to understand, easy to deploy, easy to scale and easy to grow.

It’s not perfect - the lack of namespaces can be a pain, a few more standard types would be good, and mutations feel a bit under baked - but there’s much to love, and very little to dislike.

Downsides afaik are: (1) No way to do queries, which return recursive JSON objects of arbitrary depth. (2) Not using standard JSON as a format for writing your query, instead unnecessarily making up a new querying lang, a design flaw basically. (3) More dependencies in frontend as well as backend. (4) More difficult to determin what exactly is going on in processing 1 query, ergo more difficult to fix performance problems.
Can’t say I agree.. (1) you can easily create a json data type to emit arbitrary json in your response if that floats your boat (2) graphql queries can be far more expressive than straight json (3) graphql is just a REST call that takes a string and some optional JSON and returns JSON, no need for client side libraries unless you have complex use cases that are enabled by Graphql, (4) this has not been my experience, queries are run by the backend services, not by graphql itself, so the complexity of an individual backend query does not change.

Just my experience, we can agree to disagree!

OK, it's true that you can't do recursive queries to arbitrary depth with standard GraphQL queries, but I think most people who are familiar with GraphQL would consider that a feature: the query and the returned object are the same shape, and this correlation has useful properties that I've exploited in the past.

Nevertheless, there are ways to include arbitrary JSON in GraphQL responses and since the JSON really is arbitrary, you can include any JSON you want, to any arbitrary depth, and the field parameters that drive the JSON query can themselves also be arbitrarily complex.

I'd say that's also a feature of GraphQL: easy things are easy and hard things are possible. But still, if you have a use case that requires this functionality then GraphQL might not be for you, and that's OK. Nobody's forcing you to use it.

I mean, GraphQL is also poor at serving binary data. You can just use a different endpoint for it. It's not an all or nothing thing.

I think it would depend mostly on the diversity of combinations of data the frontend needs. Their GraphQL implementation is essentially automating the process of frontend teams asking the backend for new endpoints or configurations of existing endpoints to deliver new combinations of data for use by frontend clients. It’s pretty easy to see that certain for frontend requirements the backend GraphQL will be worth it, and for other frontend requirements the backend work would not be worth it. In this case I’m relatively confident that they’re coming out ahead.
I assume that to prevent DoS attacks, the backend would need to know the list of allowed queries. This change would be easier than changing a rest api, for example.
I bet REST API + HTTP caching is going to outperform the GraphQL APIs. And maybe most importantly, it’s going to be cheaper.
I would place a bet along with you on that one. The question the graphql folk need to think about is who's paying for the value add of large one hit requests. cycles add up! There's a price to pay for having your api also have to parse and understand a request beyond just fetching the result.
I'd like to bet against you two. ;) I think GraphQL + a REST (JSON RPC) interface will be an optimal balance between developer experience, performance and security. I'm the founder of WunderGraph and we're doing exactly that: https://wundergraph.com/docs/overview/features/json_rpc
I like this approach, a lot.

I strongly believe developers should spend extra labor to ensure users and runtimes don’t waste cycles. I also believe labor or toil that repeats and can be automated, should. This marries both those ideas.

Finally, while thankfully you don’t say “we take security seriously” on your home page (buzzy but in practice usually a security red flag), your docs and baked-in approach show you genuinely do.

Pinged you via the TypeForm.

You’ll still very likely end up with dozens of http REST calls that could be one GraphQL query, hence one http call. Bias alert (I’m the founder), but I believe you should have a look at GraphQL edge caching: https://graphcdn.io
Most probably a non-issue with HTTP2.
I wonder if they feel like Node was worth it. (seriously)
Exactly, the problems they found were mostly Node problems, not GraphQL problems.
I wonder why we haven't seen yet some more optimisation for JS. The only thing I'm aware of is the Google Closure Compiler. With everyone using TypeScript, there must be a way to use all that type information to make JS run more efficiently, or compile parts of your TypeScript code.
Seems to me Elixir + Absinthe would’ve been a better tech stack.
I adore Elixir and have made a good career around it. I'll love it until my grave.

But with that many servers I can't help but wonder if Rust wouldn't be a better choice. I don't think they'll even need 100 which, as OP pointed out, will drastically reduce their operational complexity.

Furthermore, the strong static typing would have avoided the megamorphic functions right from the start.

Yeah I agree with rust, I have some simple database fetching servers that run crypto functions as well (I.e. login) and for simpler operations they can quickly get to 100K+ requests per second on a single host
> Somehow I doubt it

I don't know. They seem to be satisfied customers, and were simply optimizing an already working pipeline in anticipation of saving money.

As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN. Not that they don't exist, of course. It's just that there doesn't seem to be much debate about its promise.

If you read between lines you'll see failures. This article is an example. Spawning 7500 servers to handle this traffic is a facepalm failure.
To be fair, looking at what kind of optimizations they did to improve the situation, it looks like GQL is not to blame but rather a pretty big disconnect between their implementation practices and understanding about what is costly and what is not costly (ie lack of mechanical sympathy). People may get by without that writing client side javascript, but backend code is not as generous.
Yes, their optimisation team did great job, including writing post about it. They had 5 months from zero to some solution deployed on prod. And they shaved numbers. Great work.

However number of servers required to serve this traffic has still very poor ratio. And if you read between lines – they maxed out their optimisation effort on it. They cache at multiple layers. To be more precise this statement that they haven't found any low hanging fruits for optimisation should raise some serious questions and their analysis from "first principles" should probably be more thorough.

Yep, they seem to have settled at about 2100 servers which is bonkers insane. Go or Rust will likely max at no more than 100. If not 20-30 even.
>"As a side note, I haven't seen a single "We tried GraphQL and it failed us" story on HN"

Why would it "fail". It is just one of many possible protocols to query data. It is like arguing about using this computer language vs that computer language. Bar difference in performance they would all work.

Since it wasn’t mentioned, I’m curious if they have ever investigated the performance of GraphQL server implementations in other programming languages.
medium.com is terrible with GQL... I can have 50 tabs open in chrome. Then I open 1 tab on medium.com and CPU spike like crazy. Turns out it's their graphql-queries at 100 miles a second. One of the API (GQL) queries fails (I'm guessing my adblocker ?) then the retry seems to be in a tight never ending loop !
This doesn’t sound like a problem due to GraphQL, though?
Found the Dream11 CTO, Amit Sharma, giving a tech talk about scaling, https://youtu.be/WifL4SWGJQw
I was considering a similar architecture for something else. It’s nice to see that it basically works to a point because I was worried of exactly the costs they ran into.
>"We provision approximately 7,500 instances for 1 million requests per second."

Looking at this numbers makes me think that a single instance of properly written server running on a single dedicated piece of hardware can handle this without breaking a sweat. My servers for example handle thousands of requests per second. It looks to me like one giant waste of human and hardware resources. Not very "green" approach I would say.

The benefit you are getting from GraphQL is massive flexibility of the query in return for probably sub-par performance.

The fact that you can get X 100K requests/per second best-case is not really the point. The point is if I don't want to write hand-cranked code for every kind of possible query, I take a performance hit as a result.

Not sure how easy it would be for them to identify poorly performing queries and split them out into their own optimised code.

>"The point is if I don't want to write hand-cranked code for every kind of possible query, I take a performance hit as a result."

This is how we end up with the architectures consuming orders of magnitude more computing resources and giant management overhead. Just because someone wants to be spared from a bit of thinking.

I can see how GraphQL would work for orgs with the massive scale like FB/Google/Insert your favorite. For the most of rest of the world it is nothing but unneeded overhead on resource both human and computing.

And of course cloudy people like Amazon would love you to use all this tech. The more you slow down your application the more resources you will be leasing from them so they get more money

The default implementation of GraphQL has a lot of overhead in query parsing and validation alone. You can try this yourself with complex queries and simulating some load.

But it‘s an issue that can be solved.

">You can try this yourself with complex queries and simulating some load."

I do not need to try it. I know what it takes to parse/validate this kind of queries and then manage to get and assemble the results from numerous sources.

>"But it‘s an issue that can be solved."

No. This issue will not be solved as in general it is a problem of mapping one storage / functionality format to end client format. It can be easily solved for particular situations by writing custom servers (this is for example one of the things I do) but doing it generically introduces overhead / costs that are very unhealthy for a normal businesses.

And it is of course bad as it wastes energy.

Wouldn’t using a lambda for this be a good choice? You’re just parsing a input document into a set of backend requests and then executing them - there doesn’t have to be anything stateful here that would require an actual running instance.

If you combine this with API-gateway you’ve got caching (and potentially token auth) for free.

What you describe already exists. I'm the founder of https://wundergraph.com and we're doing exactly what you describe, combining GraphQL with Auth and Caching, plus some more extras...
I was referring to deploying the graphql routers onto an AWS lambda within your own account, which looks entirely different from your product?
I totally agree. My startup uses graphql inside a lambda function. I'm not running at any scale yet but I just cant see the point of not leveraging all that autoscaling infra. Perhaps lambda at real scale costs more but I've not seen any data on that.
10 million executions with 1gb of memory at ~300ms per invocation costs like 40 dollars.
I've loadtested Absinthe on a single, but beefy EC2 instance, I've got ~ 10K/s dummy GraphQL queries (not involving database, just a resolver returning the value directly).
Either optimising the main app or caching with Redis/Memcached can seriously reduce the number of instances & improve the 133 req/sec per server metric as well.
It sounds like they are already caching but the problem is what happens when the number of possible queries is so high, you cannot cache them all and also on "live" apps like the football ones, the result of the query might change relatively quickly so can't be cached.

What might be possible is double-level caching, so you cache underlying data and then query from that, the results of which are also cached.

Depends how fast your data changes. SWR can be very effective for that. I suggest having a look at GraphCDN (bias alert, I’m one of the founders, so take it with some salt ;)
You could do that or directly cache at the edge [0] - makes it much faster for the user and is cheaper. https://graphcdn.io does that.
At Dream11 we love GraphQL :) It has made the lives of both — service owners and clients (Android, iOS, Web) much easier!

Over the years we had packed the server with almost all the graphQL optimizations we could find on the internet. The blog outlines some of the key optimizations we had put in to improve the performance of our application code (Which doesn't have a lot to do with GraphQL, as most people have already commented). I want to still give a bit of an "insider's perspective" as much as I can, so here it goes —

1. The graphQL team that did the optimizations had two engineers who were actively working on it. It seemed like a futile project at first. The goal was to find low-hanging fruits (if any) and prepare for our peak season (IPL 2021) but eventually, find other long-term alternatives. Killing graphQL altogether and moving that logic on the clients was still on the table. Fortunately, the team did a fantastic job of optimizing it so much that we are now committed to supporting it long-term.

2. We try to keep our microservices as discrete, pointed, and as unopinionated as possible. We also indulge the clients by letting them query huge amounts of data at once. All this makes our graphQL layer seriously complex. There is a huge amount of computation that happens on this layer. To get some perspective our /health call to the server is 10x faster than the most requested graphQL query. Needless to say, it's not a fair comparison because unlink the query, health doesn't make any network calls, or has any practical CPU load.

3. We have caching implemented on our graphQL clients, however, the reason we get such a high request rate, is because our concurrency is also very high. A typical user is barely making 10 requests in a minute but overall we achieve millions of requests in a second.

4. As a part of the long-term strategy, we did consider using Rust as our choice of the stack. We had heard a lot of noise about how RUST was beating all the benchmarks. So we did some POCs internally and implemented a part of our graphQL service in Rust. What we learned was that the Rust implementation was ~2.5x faster than our node.js implementation and also consumed relatively less memory. This was fine but wasn't good enough for us to migrate our large node.js codebase, and learn a completely new stack. Building a team with domain expertise in Rust in India is particularly hard.

5. It might seem like we are not pushing the production servers hard enough, you'd be surprised to know that it's true! Because our traffic is very unpredictable we like to maintain a comfortable CPU utilization for every possible extreme scenario that our Data Science team can predict. The risk of our edge layer going down is seriously revenue hitting. So even when our benchmarks say we can push the systems 5x more, the final call remains with Site Reliability Teams and the risk appetite we have for that particular game.

6. The blog briefly also talks about using multiple ELBs, to which we distribute traffic using DNS. The problem with DNS is that it doesn't guarantee a truly uniform distribution of the traffic. Even with a very low TTL, sometimes we observe a difference of more than 20% in requests/sec between two ELBs at an instant. This and other infrastructure-specific nuances have to be considered by the SRE teams to estimate capacity on production.

7. Lastly, the servers we use on production are small machines — 8 cores for the majority of our stack. This lies in the goldilocks area where we get the best cost to performance ratios. Scaling down or up the machine type has a significant impact on the cost.

It's been a journey of love and hate with graphQL and we continue to invest in making our edge robust and even faster. Feel free to connect with us on — https://twitter.com/D11Engg

https://twitter.com/Dream11Engg?s=09

This is the new link. Some how the old link doesn't work if you have the app installed.

It was obvious to me from the beginning that GraphQL would add overhead and complexity on the backend; especially related to caching all possible permutations/views of the data. In some cases I imagine it would consume a lot of memory; wouldn't it cause a memory leak vulnerability if you allow infinite permutations to be cached by the server? On the other hand, if you only cache responses to popular requests, doesn't that expose your servers to DDoS? An attacker could just generate a ton of unique GraphQL queries to make the servers bypass the cache and consume a ton of CPU. The fact that GraphQL allows all these permutations in the queries is the root of the problem. It's not something which can be solved or optimized within GraphQL.

I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side. Who is better placed to know what resources they want than the client? A big advantage of HTTP/REST is that it either serves individual resources or a limited number of different collections of resources and it lets clients do the heavy lifting of figuring out which resources they need and how they want to combine them. Caching REST endpoints is straight forward and resilient to DDoS attacks because the variations in responses is strictly limited.

Also, it makes sense to move processing to clients when those processing costs are imperceptible to users.

> It was obvious to me from the beginning that GraphQL would add overhead and complexity on the backend;

Did you read the article? Most of the issues weren't related to GraphQL, they were just Node issues/optimizations.

> I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side.

This is the stated intent of GraphQL. Literally the reason it exists.

> The fact that GraphQL allows all these permutations in the queries is the root of the problem. It's not something which can be solved or optimized within GraphQL.

Common ways to solve that are to whitelist the allowed queries or to cache at the resolver level instead of the query level.

I'd like to argue against that. Yes, whitelisting is a solution. But Caching at the Query level can be extremely efficient. I'm the founder of WunderGraph and we're doing it like this. We turn GraphQL Operations into REST/JSON-RPC Endpoints, allowing them to be cached by CDNs, Browsers, etc... https://wundergraph.com/docs/overview/features/edge_caching
> I think it's regrettable that all the big money got behind GraphQL instead of aiming for solutions which provide resource granularity and shift decision-making to the client side. Who is better placed to know what resources they want than the client?

With GraphQL the client specifies exactly what it needs. It‘s as granular as you can imagine, unlike REST.