Hacker News new | ask | show | jobs
by mbreese 1398 days ago
You could also store timestamps as part of the records. Then your query will always be consistent if you add an extra clause of tstamp < query_tstamp. So long as you store both the query_tstamp and the last record, you’ll get consistent results without needing to store individual cursors/snapshots for each user. (You’ll still have more CPU time per query, but that’s kind of a given here).

You probably also need to switch from deleting records to adding an archive bit (or timestamp).

This gets complicated fast.

2 comments

This also presumes that your records always are processed in order, right?

We struggled with this at a prior job because records within the most recent ~10 minutes could be delayed for any reason. We'd be collecting data from thousands of IoT devices and any of them could have momentarily lost a network connection for a few minutes, been rebooted, etc. So it'd be totally normal for end users to see records from 30 seconds ago, but then a few minutes later, see older records appear in between those records and older records because there were minor delays in some of them getting sent and processed.

We were leaning toward just putting a bound on the range that the end user could see (for example, only show records that are at least ten minutes old) to reduce the variability, but that gave the appearance that our system took ten minutes to process all data. It was a tricky one to solve from a user expectation perspective.

I don't see how that solves the issue. A record can be updated for multiple reasons unrelated to the current database query, and any update at all to the record would hide it using the timestamp approach, regardless if the update would actually affect which page the data is on.
I didn’t say it was a great approach, but if you used the timestamp as part of the PK and treated every record as immutable (any update resulted in a new row), it would work.

I’m not actually a fan of this approach as I think it causes too many scaling issues. If I were to do this, I’d either keep a database cursor in my session or store the full List of PKs in my session for the current query. This way you’d have the order of results as they existed at the time or the query. You still have to deal with the UPDATE/DELETE issue, but o guess this just depends on how much you care about consistent query pagination.

I’m not that much of a stickler and most datasets don’t change fast enough to make it an issue.