Hacker News new | ask | show | jobs
by amluto 683 days ago
From reading the docs, this has an IMO surprising design decision: the “journal” is a stream of bytes, where each append (of a byte string) is atomic and occurs in a global order. The bytes are grouped into fragments, and no write spans a fragment boundary.

This seems sort of okay if writes are self-delimiting and never corrupt, and synchronization can always be recovered at a fragment boundary.

I suppose it’s neat that one can write JSONL and get actual JSONL in the blobs. But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose). And getting, say, Parquet output doesn’t seem like it will happen in any sensible way.

2 comments

:wave: Hi, I'm the creator of Gazette.

> But this seems quite brittle if multiple writers write to one journal and one malfunctions (aside from possibly failing to write a delimiter, there’s no way to tell who wrote a record, and using only a single writer per journal seems to defeat the purpose).

Yes, writers are responsible for only ever writing complete delimited blocks of messages, in whatever framing the application wants to use.

Gazette promises to provide a consistent total order over a bunch of raced writes, and to roll back broken writes (partial content and then a connection reset, for example), and checksum, and a host of other things. There's also a low-level "registers" concept which can be used to cooperatively fence a capability to write to a journal, off from other writers.

But garbage in => garbage out, and if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.

> there’s no way to tell who wrote a record

To address this comment specifically: while brokers are byte-oriented, applications and consumers are typically message oriented, and the responsibility for carrying metadata like "who wrote this message?" shifts to the application's chosen data representation instead of being a core broker concern.

Gazette has a consumer framework that layers atop the broker, and it uses UUIDs which carry producer and sequencing metadata in order to provide exactly-once message semantics atop an at-least-once byte stream: https://gazette.readthedocs.io/en/latest/architecture-exactl...

> :wave: Hi, I'm the creator of Gazette.

Hi!

> if an application correctly writes bad data, then you'll have bad data in your journal. This is no different from any other file format under the sun.

In a journal that delimits itself, a bad write corrupts only that write (and anything depending on it) — it doesn’t make the next message unreadable. I’m not sure how I feel about this.

I maintain a journal-ish thing for internal use, and it’s old and crufty and has all manner of design decisions that, in retrospect, are wrong. But it does strictly separate writes from different sources, and each message has a well defined length.

Also, mine supports compressed files as its source of truth, which is critical for my use case. It looks like Gazette has a way to post process data before it turns into a final fragment — nifty. I wonder whether anyone has rigged it up to produce compressed Parquet files.

To my knowledge, nobody's implemented parquet fragment files. But it supports compression of JSONL out of the box. JSON compresses very well, and compression ratios approaching 10/1 are not uncommon.

But more to the point, journals are meant for things that are written _and read_ sequentially. Parquet wasn't really designed for sequential reads, so it's unclear to me whether there would be much benefit. IMHO it's better to use journals for sequential data (think change events) and other systems (e.g. RDBMS or parquet + pick-your-compute-flavor) for querying it. I don't think there's yet a storage format that works equally well for both.

I don't think it's correct to say that JSONL is any more vulnerable to invalid data than other message framings. There's literally no system out there that can fully protect you from bugs in your own application. But the client libraries do validate the framing for you automatically, so in practice the risk is low. I've been running decently large Gazette clusters for years now using the JSONL framing, and have never seen a consumer write invalid JSON to a journal.

The choice of message framing is left to the writers/consumers, so there's also nothing preventing you from using a message framing that you like better. Similarly, there's nothing preventing you from adding metadata that identifies the writer. Having this flexibility can be seen as either a benefit or a pain. If you see it as a pain and want something that's more high-level but less flexible, then you can check out Estuary Flow, which builds on Gazette journals to provide higher-level "Collections" that support many more features.