Hacker News new | ask | show | jobs
by ako 1998 days ago
This doesn't sound like a "temporal consistency" problem, rather an inconsistent and untransparent ordering issue.
1 comments

Huh? It’s absolutely a temporal consistency problem. The ordering was absolutely clear and consistent (most->least popular at time of request). But the popularity scores were changing so rapidly that “at time of request” makes the ordering unstable. If the ordering was determined by popularity at the time I accessed page 1, the ordering would have been stable.

Sure, that popularity score would be stale. But who cares?

Think of it this way. Suppose you’re viewing your Twitter timeline in recent order, and suppose the pagination (lazy loaded on scroll) worked this way, and suppose you have new Tweets arriving at the same rate you scroll. What you would see is the same page of Tweets repeat forever (well, until you hit the pagination cap).

This is why people come up with solutions like cursors. But what I was suggesting is that you can instead continue to use offsets (for the benefits discussed in the article like parallelism and predictability) if you paginate on the state of the data at the time you began (edit: or on the state of your sorting criteria at the time you began, which allows for the mitigations I described upthread).

That’s not to suggest that once you begin a pagination, you’ll forever access stale data. It’s to suggest that a set of pagination requests can be treated as a session accessing a stable snapshot.

This can also be totally transparent to the client, and entirely optional (eg pagination is performed with an offset and an optional validAt token).

Ok, thanks for the explanation. I hadn't expected the popularity score to be so unstable. Means that a lot of users are concurrently scoring the fonts, and that the averages are continuously being recalculated. Unexpected.

But you're right, if the dataset is continuously changing at high frequency, pagination makes no sense.

I think we’re almost on the same page (heh). Pagination may still make sense for a variety of reasons. Even if there’s no meaningful sever optimization gain, clients may benefit from a reduced response size (mobile, expensive data plans, low power devices). That sort of thing is where ensuring consistency (for the sake of brevity I’ll repeat, at a point in time, but there are other ways to allow clients to negotiate this) at the request level over multiple requests is useful, even if the underlying data is changing much faster than the client is consuming it.

It’s worth noting here that this isn’t just applicable to paginated lists. It can also be used where you want to let the client optionally, concurrently access related resources. It can be used for client-side concurrent transactional writes. It’s a barrel of monkeys.

For what it’s worth, I wouldn’t assume their volume of traffic was necessarily the reason the data was in such flux. It could be (and I strongly suspect) that their popularity algorithm just stinks (eg weighting 100% of 1 view over 90% of 100 views). Even so, a snapshot in time is probably a much easier lift/gain for a flailing algorithm than really digging into the data and analytics to get an optimal stable sort without it:

1. Take a snapshot of the query and throw it in a dumb cache.

2. Rate limit cache key creation rather than cache hit requests.

3. Throw the caches out (and repopulate if reasonable) every [vaguely not gonna break the bank age duration].

4. Forget about any further optimization unless you reallllly need it.

5. Document the problems with the suboptimal solution, have a retro so your junior devs can develop better instincts, get them onto the next user story.

6. Put a “should we improve this” on the backlog of backlogs.

7. Profit.