Hacker News new | ask | show | jobs
by thomasmg 1133 days ago
O(n^2) algorithms often cause performance issues. The main cases I have seen in business logic are: (A) offset pagination (select ... offset n) and then paginate over all entries, and (B) read a text value, append something, store, repeat.
3 comments

Ummm.. Sure, but what's happening here is that FreeBSD stores a list in a file not in the desired order, then sorts it before use.

It seems to be any prolonged discussion about which sorting algorithm should be used is sort of skipping the elephant. Why is the list not sorted to begin with?

Without that basic knowledge it isn't very productive to focus on implementation details, no matter how fun the sorting exercise is. Deleted code is fast code.

I was talking about "exponential algorithms" in general, and about business logic. I know FreeBSD isn't business logic, but low-level kernel code. I don't know the details of the FreeBSD problem.
How select with offset leads to O(n²)?
For each page, you have a select statement. The first without offset, then with offset 10 limit 10, then offset 20 limit 10, offset 30 limit 10 and so on. The database engine will, for each query, read all entries up to the offset + limit. This is a staircase pattern: first it reads 10, then 20, then 30, and so on. So the sum of read entries is n * n / 2, which is O(n^2).

One could argue that the database engine should be "smarter", but it's not. Note that data could be added or removed in the meantime, so the database engine can't really cache the result. See also https://stackoverflow.com/questions/70519518/is-there-any-be...

Sorry, I've missed the "and then paginate over all entries" part. Anyway thank you for the detailed explanation.
So, how to do offset w/o using OFFSET?
Include the last id (which should be indexed) of the previous page in the next where filter.

https://use-the-index-luke.com/sql/partial-results/fetch-nex...

That's pagination, not indexed page scanning. Both have their place but they're not the same. Pagination is way better to handle updates between page loads and generally more complicated to implement. As you're now doing head tail index cursor tracking. Flat boring offset/limit is amazingly simple for the happy lazy path which is probably fine for most apps.
Make a query whose parameters exclude the previous page results altogether. I learned about this from here: https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...
If you need to iterate over all records, just do it? Why do you need offset.

Otherwise using offset usually is OK idea. Because users very rarely will inspect page #2153. They're interested with page 1, sometimes page 2. limit/offset works fine for those cases and it'll work for page 2153 for those who visit it once in a decade. Using ids makes logic to trac prev/next/page number incredibly complex and generally you don't need it.

> If you need to iterate over all records, just do it?

Who is "you" here?

Usually what happens is that party A builds a REST API (or other connectionless protocol) for fetching a list of some entity-type; and they limit the number of items that can be fetched in one request (because they have a billion such entities, and they don't want to even try to imagine what kind of system would be necessary to generate and stream back a 5TB JSON response); which implies pagination, to get at "the rest of" the entities.

Then party B, a user of this API, decides that they want to suck party A's whole billion-entity database out through the straw of that REST API, by scraping their way through each page.

> it'll work for page 2153 for those who visit it once in a decade

To be looking at page 2153, the user probably first visited every page before 2153. Which means they didn't do one O(N) query (which would by itself be fine); but rather, they made O(N) requests that each did an O(N) query.

I regularly change 3 to 2 in /new/345689 links when bored with today’s content.

Using ids makes logic to trac prev/next/page number incredibly complex and generally you don't need it.

When it’s a public site, users may post so fast that this “next” can show some previous page. Paging via ids is a must there.

> “next” can show some previous page

That is usually a non-issue. The cost in DB operations is usually much more relevant than it.

When people do actually care about fully enumeration and unicity of the items they they are querying, "pagination" itself tends to be a too messy concept.

The cost in DB operations is usually much more relevant than it.

As a result, a user (all of them) hits “next” again until a page looks like containing new posts. It’s multiple requests wasted.

Anyway, what exactly becomes messy?

What a "page" means when enough things appear between them that they push stuff a page away? Are those things reordered? Do people expect to see the new things before they get to the old?

A "page" is a very ill-defined thing that can only exist if your stuff is mostly static. Queries are very different on dynamic content.

Use a sorted index (e.g. Btree), and use those values to quickly find the start of the next page of results.

For good performance this also requires that your search criteria (if any) be compatible with this index.