Mistakes are always easy to recognize in retrospect, so hopefully this comment isnt too unfair, but one thing that caught me about this, is that logically it makes no sense. You would never use a bloom filter for just 10 entries. If you have only 10 entries it is almost certainly faster to skip the bloom filter. So i feel like that is the part that should have instantly stood out.
[Obviously, i've made my own silly mistakes over the years, many much sillier than this, its just weird to describe this one as only detectable by profiling]
i don't know why you're trying to analyze the meaningfulness of sentences that are not the results of a human thought process but are clearly rhetorical flourishes from an llm that "feels" compelled to fill its prose with them
Comments that explicitly call out an article as slop tend to get downvoted (or disagreed with), it's best to guide the reader towards their own conclusions.
Do you think the author is somehow capable of writing the entire codebase, but not able to reason about code???
I'm sure you've never made a silly mistake where you passed the wrong integer parameter to a function, stared at your screen, and failed to notice it. Or, forgot the order of arguments to calloc().
If you're saying that profiling is for those too lazy to reason about their code, you're distorting the whole lesson: profiling is more powerful than guessing.
So the author is doing a self-learning exercise about profiling pre-production code, and you're disagreeing with them by comparing it to a commercial contract. I'm sure you've never, ever made a dumb mistake while getting paid.
No, that's not the point. This isn't a situation where you need to "guess"; bloom filters should be sized according to their capacity. This is akin to having a fixed 10-arg buffer for your program, getting a crash when someone passes 11, and saying "this is the kind of bug you only find by building the thing and measuring it". Yeah it happens and we all make silly mistakes, but it's just not true that this couldn't have been foreseen.
There are a lot of use cases where you only truly need consistency, and durability can take a back seat. RocksDB for example does not fsync its WAL writes in the default configuration.
If you can't at least guarantee write ordering you don't even have consistency.
Fsync is often used when the data doesn't truly need to be on disk, because there aren't very good write ordering APIs exposed, even if that's all you truly need.
Well, the thing about reliability is that you can't really guarantee it by testing one particular scenario.
It seems to me that neither the old nor the new version of the code is really "durable" as I would understand the word. The old version made a write syscall per batch, but doesn't say it also did an fsync per batch. The new version writes data to an mmap'ed file, and calls fsync in the background.
So both versions are "durable" in the sense that written data is preserved even if the process gets killed, because it's in the OS page cache. But in both versions, a write can be completed before the data actually makes it to disk, so a power failure will lose acknowledged writes.
> A 100-bit bloom filter holding 100,000 keys is saturated instantly. Every bit is set. It returns “maybe present” for every key you ask about — which means it filters nothing, and every read falls through to a full file scan.
Hahaha. (Seems like the bloom filter library isn't set for maximum false positive rate and/or to autoexpand.)
Edit: Actually there's a BloomFalsePositive setting, maybe it never gets used? Also maybe it's not a library and it's a custom implementation.
> This is the kind of bug you only find by building the thing and measuring it.
No? I mean, maybe if you're vibecoding it's the only way, but in the prehistoric days you could reason about what code would do before you ran it.