Hacker News new | ask | show | jobs
by simonw 1398 days ago
The only approach I can think of which might be able to handle mutability of the rows used for sorting would be to support paginating through a snapshot: provide a mechanism whereby a full snapshot of the query at a specific point in time is captured such that the user can then paginate through that snapshot.

Expensive to implement, so this would only work for situations where you can afford to spend significant storage and computation resources to keep a specific user happy.

3 comments

That might be even worse (or just as bad) for users, as now they won't see any updates to the underlying data set even if they want to, and will presumbaly need to perform some explicit action to get a new snapshot.

Personally, if you care about users not missing any item in a query, you just can't use pagination at all, and you have to give them every item in the query in a single huge dump (running the query again would be the "explicit action" mentioned above that gets the user new data). Conversely, if you use pagination, users are free to assume that they might miss some items unless they already expect the underlying data to be immutable.

Not quite. The API action and the previous responses pagination token is what denotes the consistent list “snapshot.” They get a current view of the datastore every time they start a new paginated API operation, eg without passing an existing token.

Edit: you might think of the list results like a materialized view. It works really well with natural ordering on something like item create time and you can pass that in as a query param.

This theoretically should be possible with MVCC, right? It's not an area I've explored and I could immediately see some issues with resource clean-up, but I could imagine it being possible with most modern DBs.
Yep, keep transaction open with necessary isolation. But it requires very thorough design of queries, as you'll run into locks pretty quickly. MVCC is not magic.
You could also store timestamps as part of the records. Then your query will always be consistent if you add an extra clause of tstamp < query_tstamp. So long as you store both the query_tstamp and the last record, you’ll get consistent results without needing to store individual cursors/snapshots for each user. (You’ll still have more CPU time per query, but that’s kind of a given here).

You probably also need to switch from deleting records to adding an archive bit (or timestamp).

This gets complicated fast.

This also presumes that your records always are processed in order, right?

We struggled with this at a prior job because records within the most recent ~10 minutes could be delayed for any reason. We'd be collecting data from thousands of IoT devices and any of them could have momentarily lost a network connection for a few minutes, been rebooted, etc. So it'd be totally normal for end users to see records from 30 seconds ago, but then a few minutes later, see older records appear in between those records and older records because there were minor delays in some of them getting sent and processed.

We were leaning toward just putting a bound on the range that the end user could see (for example, only show records that are at least ten minutes old) to reduce the variability, but that gave the appearance that our system took ten minutes to process all data. It was a tricky one to solve from a user expectation perspective.

I don't see how that solves the issue. A record can be updated for multiple reasons unrelated to the current database query, and any update at all to the record would hide it using the timestamp approach, regardless if the update would actually affect which page the data is on.
I didn’t say it was a great approach, but if you used the timestamp as part of the PK and treated every record as immutable (any update resulted in a new row), it would work.

I’m not actually a fan of this approach as I think it causes too many scaling issues. If I were to do this, I’d either keep a database cursor in my session or store the full List of PKs in my session for the current query. This way you’d have the order of results as they existed at the time or the query. You still have to deal with the UPDATE/DELETE issue, but o guess this just depends on how much you care about consistent query pagination.

I’m not that much of a stickler and most datasets don’t change fast enough to make it an issue.