Supporting `MAXSIZE` as well as `MAXLEN` on `XADD` would also handle a nice Kafka feature (the ability to define your log size in either number of messages or size on disk).
Something like: `XADD MAXSIZE ~ 2147484000 * foo bar` to cap the stream at 2GB + 1 node.
I'll probably eventually hate myself for even bringing this up but I can't help but notice the similarity between this ID structure and Version 1 (aka timestamp) UUIDs. While I wouldn't go as far as recommending that you fully adopt that form, it might be worth considering if you could make these IDs compatible with UUIDs by defining a canonical transform. The critical differences are:
- UUIDs use a different epoch (15 Oct 1582 vs 1 Jan 1970)
- UUIDs count 100 ns blocks instead of ms
- UUIDs include a 6 byte "node id"
- UUIDs allow only up to 15 bits of "sequence"
I think that last one is the biggest deal, since as currently specced redis allows 64 bits of sequence, which is obviously much bigger than 15. The options I see are either up the time resolution used by redis, encode some of redis's sequence bits into the UUID's time bits, or just live with it as a limitation--in practice 2^15 is a lot of messages to get in a single millisecond (though in cases of clocks jumping back might not be too much).
You'd also need to come up with some thing for the node id, perhaps the first 6 bytes of a cluster node ID or similar.
Thinking about this some more I realized you could also encode sequence into the low order bits of the timestamp, and rereading the RFC showed it actually makes this recommendation[1]. There are 10000 100-nanosecond periods per millisecond which gives about 13 more bits. Between that and the 13-15 bits available in the clock sequence you've got 26+ bits of sequence or ~67MM values per millisecond.
Since 64 bits is overkill for milliseconds (45 bits covers the next 1000 years or so) I was thinking you could put 2 bytes of the node id in the high order bytes there (perhaps could call this the "clock id"?) and the remaining 4 bytes of the node id could go in the high order bytes of the sequence, which would still leave 32 bits for actual sequence values (but we should only use 26 or so). This means we'd get a translation roughly as follows (numbering bytes and bits from high to low significance):
Redis Version 1 UUID
Timestamp
Byte 0-1 "clock id" Bytes 4&5 of node id
Byte 2-7 millis since 1 Jan 1970 * 10000 => ~45 high order timestamp bits
Sequence
Byte 0-3 "node id" Bytes 0-3 of node id
Byte 4-7
6 bits wasted space ignored
26 bits actual sequence value
13 high order bits => clock sequence
13 low order bits => low order timestamp bits
Another implication of this scheme is that if redis has access to a clock that offers higher than millisecond resolution it could store everything more precise than millisecond into the sequence portion of the id.
On a side note it seems that the clock sequence in the UUID is intended to be reset to a random value at start up and every time a clock jump is detected rather than just incremented. Redis could do something similar by incrementing some of the 13 high-order bits of the sequence every time a clock jump is detected (and/or if the 13 low-order bits overflow)
Why keep the separation at all ? Are clients expected to be able to query for a given timestamp precisely ? Because then you get all the problems with clock synchronization, especially given that the Streams' clock is monotonic and I'd expect clients' clock to not be
It might be useful to be able to query by server time regardless of whether your client clock is in sync. You retrieve some set of data and the next time you can ask the server to give you everything newer than x, where x was the highest time stamp you got from the server previously.
Yes exactly, you want to ask what is newer than x, where x is the last event you're aware of, but you don't really care about the date and time in that case. If you just store the last id given by redis Streams naively then you don't even care that they're timestamps; at that point my question is, why even bother with the distinction. Just ask for everything after x and be done with it.
Redis also has TIME to get the current server time with milliseconds and the unix time stamp. I'm reasonably sure that's what's being used to get the first part of the ID anyway.
Ah, I didn't know that. Although the post says that the timestamp of an id might also be the timestamp of the last message, since the clock can go backwards, so in the worst case a client might get some duplicate messages.
You can also have a look at the technique used here to create collision-free sequential unique IDs across a cluster, even if it is just for inspiration: https://www.npmjs.com/package/cuid
Example:
c - h72gsb32 - 0000 - udoc - l363eofy
The groups, in order, are:
1. 'c' - identifies this as a cuid, and allows you to use it in html entity ids. The fixed value helps keep the ids sequential.
2. Timestamp
3. Counter - a single process might generate the same random string. The weaker the pseudo-random source, the higher the probability. That problem gets worse as processors get faster. The counter will roll over if the value gets too big.
4. Client fingerprint. For example, the first two chars are extracted from the process.pid. The next two chars are extracted from the hostname.
Something like: `XADD MAXSIZE ~ 2147484000 * foo bar` to cap the stream at 2GB + 1 node.