Hacker News new | ask | show | jobs
by eyelidlessness 1998 days ago
A certain level of parallelism is generally within the realm of good API citizenship. Even naive rate limiting schemes tend to permit a certain number of concurrent requests (as they well should, since even browsers may perform concurrent requests without any developer intervention).

Rate limiting and pagination aren’t (necessarily) about making full data consumption more difficult. They’re more often about optimizing common use cases and general quality of service.

Edit to add: in certain circles (eg those of us who take REST and HATEOAS as baseline HTTP API principles), parallelism is often not just expected but often encouraged. A service can provide efficient, limited subsets of a full representation and allow clients to retrieve as little or as much of the full representation as they see fit.

1 comments

One thing that frequently bugs me is APIs limiting number of items per page for reasons of efficiency. I can perfectly understand low limits for other reasons, like not helping people scrape your data.

But limiting for efficiency is usually done in a way that I would call a cargo cult: First, the number of items per "page" is usually a number one would pick per displayed page, in the range of 10 to 20. This is inefficient for the general case, the amount of data transmitted is usually just the same size as the request plus response headers. So if the API isn't strictly for display purposes, pick a number of items per page that gives a useful balance between not transmitting too much useless data, but keeping query and response overhead low. Paginate in chunks of 100kB or more.

In terms of computation and backend load, pagination can be as expensive for a 1-page-query as for a full query. Usually this occurs when the query doesn't directly hit an index or similar data structure where a full sweep over all the data cannot be avoided. So think and benchmark before you paginate, and maybe add an index here and there.

Strongly agree. I have an API I work on where if you ask for a couple of gigabytes of data, it'll send it to you, because if you're asking for it, that's what you want. It gets streamed out, and the docs warn you that you will get exactly what you asked for, so you may want to chunk up on your side (for this particular API there is a trivial way for clients to do that), or if you can handle a full stream, go nuts.

Pagination would just complicate things. I think with most APIs, intended as APIs (i.e., not just an endpoint primarily meant to feed a front-end page), you're better off thinking of your default as "I'm going to just stream everything they ask for", and look for reasons why that won't work, rather than start from the presumption that everything must be paginated from the beginning.

Don't get me wrong; there are plenty of solid reasons to paginate. You may discover one applies to your API. But if you can leave it out, it's often simpler for both the producer and the consumer. Wait until you find the need for it. Plus, if that happens, you'll have a better understanding of the actual problem you need to solve and better solutions may reveal themselves.

Pagination for a one-page-query is rarely the same cost in my experience, in real-world scenarios.

In very simple cases, like a single table sql query, absolutely - databases effectively have to compute the full result, sort it, and return a window. There's almost no reason to paginate here, at an API level, unless the consumer wants only a subset (say, bandwidth limitations). Sending it all at once can be a huge benefit for those that will use it all, it's both simpler and faster for all parties.

But in most real-world cases, there are at least two additional details that can add significant response time: joins (when not involved in sorting) and additional data-gathering needed to fully build the response (e.g. getting data from other systems, internal or external). Joined data is not typically loaded prior to computing limit/offset since it may be a massive waste, and external data is effectively the same issue but with far higher latency.

And that's before getting into other practical issues, e.g. systems that can't process the response stream as it comes in - a subset will load-and-return faster than the whole content in all cases, so e.g. a website loading some json can show initial UI faster while loading more in the background. Streaming is often possible and that'll negate a lot of the downsides, but it's far less common than processing a request only after it completes.

Strongly disagree. I've seen too many cases of api users that are overfetchting for no reason. I don't mind providing a bulk api, but that is a very different use case that regular endpoints shouldn't have to support.
my intent with pagination is always to prevent problems with open ended size of data set. 1000 at once is usually not a problem but 10x , 100x etc.. is a big problem for transferring over the wire.