Hacker News new | ask | show | jobs
by theikkila 1044 days ago
Related to 1. If I understood corrently the agent generates single object per each flushing interval containing all data accross all topics it has received. Does this mean that when reading the consumer needs to read multiple partition data simultaneously to access just single partition? How about scaling consumers horizontally how does WarpStream Agent handle horizontal partitioning of the stream from consuming side?
1 comments

[WarpStream co-founder here]

That is correct about flushing. RE: consuming. The TLDR; is that the agents in an availability zone cluster with each other to form a distributed file cache such that no matter how many consumers you attach to a topic, you will almost never pay for more than 1 GET request per 4MiB of data, per zone. Basically when a consumer fetches a block of data for a single partition, that will trigger an "over read" of up to 4MiB of data that is then cached for subsequent requests. This cache is "smart" and will deduplicate all concurrent requests for the same 4MiB blocks across all agents within an AZ.

It's a bit difficult to explain succinctly in an HN comment, but hopefully that helps.

Is there a reason you built that cache layer yourself (rather than each node "just" running its own sidecar MinIO instance, that write-throughs to the origin object store?)
(WarpStream co-founder)

The cache is for reads, not writes. There is no cache for writes.

We built our own because it needed to behave in a very specific way to meet our cost/latency goals. Running a MinIO sidecar instance means that every agent would effectively have to download every file in its entirety which would not scale well and would be expensive. We also have a pretty hard and fast rule about keeping deploying WarpStream as simple as rolling out a single stateless binary.