| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tzury 452 days ago

I’ve found that starting with a plain old filesystem often outperforms fancy services - just as the Unix philosophy (“everything is a file” [1]) has preached for decades [2].

When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:

    1. uploaded the raw logs to Cloud Storage, and
    2. tracked state with three folders: `pending/`, `processing/`, `done/`.

A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.

Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.

[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy

4 comments

sunshine-o 452 days ago

Absolutely.

I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.

- Hierarchy are dirs,

- Keys are file names,

- Value is the content of the file.

- Other metadata are in hidden files

It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.

Most important you don't need fancy editor plugins or to learn XPath, jq or yq.

link

drob518 452 days ago

Yes, but a couple downsides:

1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.

2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.

link

sunshine-o 452 days ago

> 1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.

It really depends how comfortable you are using the shell and which one you use.

cat, tree, sed, grep, etc will get you quite far and one might argue that it is simpler to master than vim and various format. Actually mastering VSCode also takes a lot of efforts.

> 2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.

> 3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.

Agreed but for most use case here it really doesn't matter and if I need to optimise storage I will need a database anyway.

And I sincerely believe that most micro optimisations at the filesystem level are cancelled by running most editors with data format support enabled....

link

cryptonector 452 days ago

Except that now when you do need a tool like XSLT/XPath, jq, or yq, now you need bash. I use bash lots, but still I'd rather use a better language, like the ones you listed.

I'm being slightly hypocritical because I've made plenty of use of the filesystem as a configuration store. In code it's quite easy to stat one path relative to a directory, or open it and read it, so it's very tempting.

link

user3939382 452 days ago

You don’t need bash to traverse a file system, are you saying something else?

link

cryptonector 451 days ago

You don't need bash itself. Substitute any shell, Python, whatever.

link

ryanianian 452 days ago

Not sure if it's still in use, but for a very long time, AWS billing relied on getting usage data via rsync.

link

cratermoon 452 days ago

Command line tools can be 225x faster than a Hadoop cluster. https://news.ycombinator.com/item?id=17135841

link

dominicq 452 days ago

Can you say more about the use case? What problem were you solving? How did it work exactly? Sounds interesting so I'd like to learn more.

link

tzury 452 days ago

Sure.

We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.

   Traffic profile

     - Baseline: ≈ 15 B requests/day
     - Under attack: the same 15 B can arrive in 2-3 hours

Why BigQuery (even in alpha)?

It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.

Pipeline (all shell + cron)

Edge nodes → write JSON logs locally and a local cron push to Cloud Storage

Tiny VM with a cron loop

   - Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
   - Executes `bq load …` into the customer’s isolated dataset.
   - On success, moves the blob to `done/`; on failure, drops it back to `pending/`.

Downstream ML/alerting* pulls straight from BigQuery

That handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.

link