| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EGreg 348 days ago

Simple. Use the principles I said above, for distributed and asynchronous systems.

The SQL server should have a way to do pubsub. It can then notify your PHP webhook via HTTP, or run a script on the command line, when a row changes. It should have an interface to subscribe to these changes.

If your SQL data store lacks the ability to notify others that a row changed, then there’s your main problem. Add that functionality. Make a TRIGGER in SQL for instance and use an extension.

If MySQL really lacks this ability then just put node.js middleware in front of it that will do this for you. And make sure that all mysql transactions from all clients go through that middleware. Now your combined MySQL+Node server is adequate to help clients avoid stale data.

As I already said, if you for some reason refuse to handle push updates, then you have to store the latest row state (etags) in your PHP cache (apcu). And then when you access a bunch of cached items on a request, at the end collect all their ids and etags and simply bulk query node.js for any cached items that changed. You can use bloom filters or merkle trees or prolly trees to optimize the query.

Joel Gustafson had a great article on this and now uses it in gossiplog: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

2 comments

tsimionescu 348 days ago

You're just pushing much of the complexity to that notification layer.

What if an SQL update changes a million rows, and you had cached the sum of those million rows? Should it send a million notifications? Should it re-run the sum for you? What if it's a complex operation and it takes a while to re-compute, and another update arrives before the re-compute finishes?

And of course, you will always have some partitions, so you will occasionally need to re-query the data anyway and re-establish any kind of pubsub system. And the complexity in efficiently reconciling two views of a large amount of data is a major issue, that you are just brushing off as "you can do various things to maybe optimize the query".

link

EGreg 348 days ago

Yes, if you really want to show something as ambitious as a sum of a million rows, then you should absolutely move to an event-based, push-based architecture. Because once you’ve calculated the sum, the least work you would need to do is to periodically look at batches of UPDATES to rows, and update your sum based on that (rather than, say, running SUM again on the whole entire table). You can either pull a digest periodically, or — you can handle incremental updates by listening for pushes.

All the difficulties you cite, including the “optimizations” I mentioned, stem from not using push, and insisting on periodically polling something.

If you step back, your whole problem is that you are using a system that only has pull and not push architecture. You’re trying to put a bandaid on that core fact. You chose a relational database system that you refuse to build extensions for (as opposed to pretty much any other system) in order to insist “nyeh nyeh you havent solved cache invalidation”.

Caching is just one part of sync. The part that stores a non-authoritative, slightly out of date copy locally. And sync, or replication, is just one part of an eventually consistent system. I mean, even MySQL supports replication, so just hook your client up to that protocol (encrypted in transit) and boom, now you can update your cache.

Here’s the thing. If you use any network protocol that is open, eg HTTP, then yea it’s solved. Because you can have webhooks and just spin up a server to handle them, and update your cache. A row is removed? Decrement your sum. A row is added? Increment it.

You are just insisting that “no, our system has to be clunky and we refuse to implement webhooks / websockets / sockets / push / mysql replication client / updates of any kind, now solve it for me to be as good as a system which has push capabilities.”

link

logical42 348 days ago

Adding change data capture to a database isn't exactly trivial.

link

EGreg 348 days ago

Well then maybe consider making a sql replication client that would ingest the changes (like most databases, mysql writes to a straming append-only log, before it is compacted). Just parse the log and act on them.

Not trivial, really? Here, enjoy: https://github.com/krowinski/php-mysql-replication

link

logical42 346 days ago

That means they did implement change data capture, and only exposed it for a very specific use case.

link

EGreg 345 days ago

Well, good. Your problem is artificial, choosing tools and scenarios to avoid "push". Your problem isn't with invalidating caches being hard in 2025. I've explained how to do it in both pull and push scenarios. Certainly not "the only hard problems in computer science." And neither is naming!

link

logical42 346 days ago

You didn't implement change data capture. You're using theirs.

link

EGreg 346 days ago

I’m also not making my own programming language, I’m using PHP. And I’m using MySQL. What is your point?

Are you saying that “invalidating caches is hard… if you insist on using only systems with no way to push updates, and oh if they have a way to push updates then you’re using someone’s library to consume updates so it doesn’t count?”

link