| One of the guys behind Convex explained the rough idea; I hope I do it justice. The strategy is to break the subscription up into listens based on the read-set ranges of the query. Then you put the individual read-set ranges into a system table that you index. Finally when writes happen, you notify all queries who's read-set intersects with the write-set. For example, say I have a query `SELECT * from block WHERE parent_id = X AND last_modified_at > Y`. This query might create two subscriptions for its read sets: { query_id: Q, table: block, column: parent_id, min: X, max: X }
{ query_id: Q, table: block, column: last_modified_at, min: Y, max: Infinity }
Now a write happens: `INSERT INTO block { id: Z, parent_id: X, last_modified_at: Y + 10, title: "hi" }`We can find our subscriptions to notify by doing queries like these: SELECT query_id FROM subscription WHERE
table = block
AND (
(column = id
AND min <= new_block.id
AND max >= new_block.id
)
OR
(column = parent_id
AND min <= new_block.parent_id
AND max >= new_block.parent_id)
OR
(column = last_modified_at
AND min <= new_block.last_modified_at
AND max >= new_block.last_modified_at)
)
Then you notify all those query_ids.I'm sure there's a lot of details and optimizations you can do on top of this; finding the right read sets seems pretty tricky for complex queries. Plus stuff like batching/debouncing, etc. |
There’s several challenges with this approach that come up for me (which unless I’m mistaken is the naive approach of checking each write against a notification set).
The first is that maintaining the read set seems very expensive since it scales with the number of live queries installed. In a multi tenant DB, that seems tricky to manage in terms of memory used.
The second is that the computation of intersection with a read range seems potentially very expensive. Imagine you have a string column. You now have to do a string comparison for every insertion that might hypothetically match the query.
Finally, computing that read set correctly seems challenging (as you mention) and it’s not immediately clear to me it’s always tractable (eg could you do this with arbitrary complicated joins?).
Additionally, in your description, each write has an implicit table scan to do the select to find the intersection. That will tank write throughput even for small total intersection sets (eg there’s a reason LSM databases do deletes by inserting a tombstone instead of checking whether the data exists first, same with the merge operator in RocksDB - a db read in the write path significantly kills performance)