Hacker News new | ask | show | jobs
by mambodog 3274 days ago
If you have low latency queries (eg. because most things hit cache rather than db) sequential queries aren't as much of a problem. Being able to join everything into one query, on the other hand, is a luxury which can be hard to maintain at scale.

In the 'join in SQL' case you could identify particular cases which are doing sequential queries and implement a different loader which just does one. It's not automatic but perhaps in most cases it's not necessary to do this step anyway. In the worst case you're back to doing as much work as you would for a bespoke API endpoint, but that's not the typical case. How much of a problem this ends up being in practice very much depends on the type of app you're building and how you intend to scale it.

1 comments

Saying join is a luxury is a dangerous thing :) (for impressional devs :)) 99% of projects are not "at FB scale" :) so join is exactly the right thing to use, it's been tuned over decades so until your scale/dataset does not outgrow one box (and there are big boxes now), you are not going to do a better job then the query optimiser (cause after all that is what you are trying to do).
Fair enough. What I'm trying to get at is that you need to pay attention to the cost of the real world queries which are actually being executed against your API, and then optimize. I'm not sure that SQL is a magic bullet there either (you can still request too much data at once, for example).

In practice this probably means adding some logging of how long requests take and graphing it (say, 95th percentile request time) from time to time to spot pathological queries. Even better if you can automate it. I think this is stuff that everyone should be doing (after a certain stage), regardless of whether you are using GraphQL or a bespoke JSON API.

I'd say build out your GraphQL server under the assumption that joins aren't always an option, but allow them as an optimization where possible.
why would i "optimise" for the thing that might never happen (join not an option) while taking the hit right now? i've tested this (3 level query) and the throughput is 10 times slower with dataloader, as in i need 10 servers instead of one to do the same job.
I mean to structure your server in such a way where you're not trying to automatically convert your entire query AST into a single massively complicated SQL query. It's true that in simple cases, one complex query can be the fastest way to get the data you need, but you leave this territory pretty quickly.

It's perfectly valid to identify subtrees of your query that would benefit from being executed as a database query, but to do that to your entire query just sounds like you're asking for trouble, I'd even go so far as to call it a premature optimization.

I kind of understand why you'd think it's a massively complicated query. You are probably thinking the types of joins on two tables on random columns with weird conditions which go into full table scan.

I am talking about queries/joins between tables that have foreign keys between them, like client/project/task/comment. I bet 90% of graphql schemas expose those kinds of relations between types.

For those type of relations (with FK) i can generate a single query that is as fast as it can be (certainly faster then dataloader) and as far as i've tested (a few millions of rows in tables, 3-7 levels in a query) i didn't leave the fast territory :) Of course there might be edge cases ...

About premature optimisations. Everyone likes to quote that, but never the full one which is "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." https://shreevatsa.wordpress.com/2008/05/16/premature-optimi...

The paper where it was published was about using goto to optimize loops and such small things, it was never "targeted" at algorithms and architecture.

As i think i mentioned here above, i was getting 10X throughput with joins compared to dataloader and i would not call that premature.

I use a programming language that is 1-2 orders of magnitude slower than C, should I switch today?

I get my data for a full UI within the budget i've allowed for uncached scenarios (100ms, but I want to go to 50ms). The approach you're suggesting will (in some circumstances) give me some short-terms win in terms of throughput and response times, but you've not said anything to suggest that I won't lose these benefits as I gradually transition into a domain-driven or micro-services (yuck) architecture.

I like to build my GraphQL servers under the assumption of a domain-driven architecture (because that's where all the projects I've worked on seem to end up, your mileage may vary), and then shoe-horn in some short-term performance tricks when I can.

I'm possibly a special snowflake here, but it's been a long time since i've had the opportunity to work on a project where I can go straight to the DB. Be it Elastic Search, a 3rd party, ill-advised micro-services, or complex logic in-between storage and presentation; nothing has quite been a pure DB project in the last 6-7 years.

Of course, you could argue this is premature architecture ;) but many of these complexities are from day one, or at least pretty early in a project's life.