| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shikhar 189 days ago
	This is fair question. A stream here == a log. Every write with S2 implementations is durable before it is acknowledged, and it can be consumed in real-time or replayed from any position by multiple readers. The stream is at the granularity of discrete records, rather than a byte stream (although you can certainly layer either over the other). ED: no k8s required for s2-lite, it is just a singe binary. It was an architectural note about our cloud service.

1 comments

csense 189 days ago

Your documentation needs improvement. It proudly mentions the alphabet soup of technologies you use, but it leaves me completely baffled about what s2 does, what problem s2 is trying to solve, or who the intended audience of s2 is.

So you frame the data into records, save the frame somehow (maybe with fsync if you're doing it locally, or maybe you outsource it to S3 or S3-compatible storage?), then ack and start sending it to clients. Therefore every frame that's acked or sent to clients has already been saved.

Personally I'd add an application level hash to protect the integrity of the records but that's just me.

At first glance I wondered if a hash chain or Merkle tree might be useful but I think it's overkill. What exactly is the trust model? I get the sense this is a traditional client-server protocol (i.e., not p2p). Does it stream the streams over HTTP / HTTPS, or some custom protocol? Are s2 clients expected to be end-user web browsers, other instances of s2 or something else?

link

sensodine 189 days ago

> it leaves me completely baffled about what s2 does, what problem s2 is trying to solve, or who the intended audience of s2 is

Regarding S2 generally (not just s2-lite), the intent behind it is to turn the core data structure from streaming platforms (like Kafka) into a serverless primitive -- kinda similar to what object storage did for file storage.

So if you are already in the world of working with streaming platforms, S2 gives you a simpler API, bottomless storage (S2 itself uses object storage for durability), and no limits on the quantity of streams you can create and work with. Streams also all have URIs and are directly accessible over REST with granular access controls.

This enables new types of patterns for working with streams, other than just the traditional ones where people typically reach for streaming platforms (like CDC, ETL pipelines, etc). An agent can have its own stream to serialize state onto, for instance; you can use a stream as a durable transport layer -- e.g., you want to reliably provide a flow of data (tokens from a model, financial ticker data, etc) to a user and allow them to resume from exactly where they left off if they are disconnected, for instance; you could use streams as a durable ingest buffer, for collecting data that will eventually reside in an OLAP like Clickhouse.

link

shikhar 189 days ago

> Personally I'd add an application level hash to protect the integrity of the records but that's just me.

The durability is for being able to replay the stream, a hash will not let you reconstruct the original message(s).

If you just need ephemeral comms, making it persistent is indeed overkill. But reliability challenges often come up with seemingly ephemeral comms too – think streaming responses from an LLM. The last mile can be pretty flaky e.g. iOS will cancel connections when users background an app. Using a durable stream for persisting the tokens means a client can ask to resume from where it left off / from the beginning of the stream, and the data would be available without having to re-inference.

link