Hacker News new | ask | show | jobs
by MontyCarloHall 1143 days ago
I’ve experimented with using gcsfuse and its AWS equivalent, s3fs-fuse in production. At best, they are suited to niche applications; at worst, they are merely nice toys. The issue is that every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation.

For certain applications that consistently read limited subsets of the filesystem, this can be mitigated somewhat by the disk cache, but for applications that would thrash the cache, cloud buckets are simply not a good storage backend if you desire disk-like access.

What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts.

15 comments

This seems overly pessimistic to me.

Sure you're not going to use this as a consumer in place of a local disk, nor are you going to use this as part of your web app.

But there are lots of situations in reporting, batch/cron jobs, data processing, and general file administration where it's incredibly easier to use the file system interface than to use an HTTP API via a cloud storage library. Which FUSE is a godsend for. The latency doesn't matter in these cases for one-off things or scripts that already take seconds/minutes/hours anyways.

So no this isn't niche or a toy. It's a fantastic production tool for a lot of different common uses. It's not for everything but nothing is. Use the right tool for the job.

In the old days, we had a system called NFS (Network File System) where, yes, you may decide to use only remote disks. There were several advantages apart from lowering the cost of disks, mainly that you could centrally manage boot images for a fleet of machines. Then we got the web and everyone seemed to assume you could do the same thing over the internet.

I agree with you, I would prefer a local disk to one with 100+ msec of latency and local storage prices are at the point where the right answer is probably "just add local storage."

But I watch with some sympathy the small army of sys-admins (something like 15-20 people) responsible for managing the 3000+ Macs our company uses and remember the 2 person staff which supported the 1500+ diskless workstations from my years at a sadly defunct mini-super-computer manufacturer. It was quite nice... you could go to any machine and log in and your desktop would follow you. I'm told doing the same thing with MSFT requires 10-20 people just to manage the AD hardware (though as a unix-fan, I hang out with other unix-fans who are notoriously rude to MSFT, so maybe it's only 5-10 people needed to manage the AD instance.)

Not old days. NFS is still widely used in the industry. In fact some of them cost millions of dollars for high end computer farms, e.g: isilon
I still use NFS in my home.
I do too. It just works. Though I boot off a local drive.
Applications for which filesystem-like access is important (i.e. requiring lots of POSIX file I/O system calls, e.g. read(2)/write(2)/lseek(2)) but latency is unimportant seem pretty niche to me. If you don't need any of the POSIX syscalls, it's not that much more difficult to work with bucket URLs vs. file paths — the general format is the same, i.e. slash-delimited file/directory hierarchies.
Not everything is a webserver. There's a lot of software out there that wouldn't expect files to exist anywhere else besides on disk, and it's not worth fetching them all from cloud storage before you begin working on the data. It's easier just to GCSFuse a bucket to a VM and let the user do what they will. Works great for ad-hoc analysis of poorly or unstructured data.
And for your use case, the latency is not a concern? I suppose that would be true if you were mostly dealing with really big files and only cared about reading large contiguous chunks of them, but I would consider this a fairly niche application.

In my use case, taking ~1 second each time to `ls` a directory, `stat` a file, or `lseek` within a file was simply unacceptable. This was on a cloud VM, so the latency would be at its absolute minimum.

In VFX a single texture can have terabytes..
The problem is that such systems have a habit of growing in scope until they reach a point where you really do need the more optimal access patterns of using the real HTTP APIs, and the inefficiencies of emulating the full filesystem API will gradually start to bite you. Maybe you’re lucky enough that that won’t happen, but it’s important to understand it for the quick hack job it is, IMO.
In most situations that time is years, decades, or ‘never’. Which is fine.

Not everyone or everything scales faster than bandwidth and/or CPU is.

I agree. For example if you want to use Google's ASR (Automated Speech Recognition), if your file is longer than 1 minute in duration, you first need to upload it to a bucket, which is a lot of added complexity compared to a regular HTTP POST.

Just copying the file to a mounted bucket would make this a lot easier.

Then again, how does one get the metadata of the uploaded file?

Calling any software system "niche" is kind of hilarious, as if, if it isn't postgres it's a massive failure. It's not supposed to be a high-performance cache of data.

My company uses GCSFuse for ad-hoc analysis/visualization of large but poorly structured output from our lifesciences jobs and it works just fine for that.

Yep. I once inherited a system where the previous team had used GCSFuse to back the `/etc/letsencrypt` directory on a cluster of nginx webservers. It "worked" and may have been a reasonable approach at the time they built it, avoiding setting up a single "master" to handle HTTP-01 challenges (and it was before GCP's HTTPS LB could handle more than a handful of domains/certificates). The problem was that as the number of domains/certificates it handled increased, nginx startup or config reload time got slower and slower as it insists on stat-ing and reading every single file in that directory in the process. It got high enough that it started running into request throttling on the storage bucket. It's no fun when `nginx -s reload` takes two minutes and sometimes fails completely.
The most wrong part of that previous team is to store private keys unencrypted in the cloud, not the performance part.
I mean... literally every VM running nginx or apache that I've ever seen has had the SSL certs just sitting on the filesystem in /etc/ssl or /etc/letsencrypt or similar... All of letsencrypt's documentation points people in that direction.
My understanding is that everything is encrypted by default in GCP. Though you need to manually configure encryption keys if you want to prevent Google ever having access to your data.
This I don't understand. Even if you configure KMS, those are still keys stored on Google infra.
You can use your own KMS outside the Google infrastructure. https://cloud.google.com/storage/docs/encryption/customer-su...
>What I would really like to see is a two-tier cache system

Is there any sort of Linux HSM (Hieracrhical Storage Manager)? I haven't see any and have been a bit surprised nothing has really developed there. They can manage putting hot data in RAM, SSDs, colder or larger data on spinning rust, deep freezing onto a tape silo or a cloud storage...

Some of the NAS devices and RAID cards can support a two-tier caching or data migration using SSDs, where hot or highly-random data (usually identified by smaller write sizes) go to the SSDs, and then can migrate to the spinning discs.

I've done some "poor mans" version of this using LVM, where I can "pvmove" blocks of a logical volume between spinning discs and SSDs, which is pretty slick, but a very crude tool.

CASTOR comes to mind for a start.

Take a look a the CERN paper https://iopscience.iop.org/article/10.1088/1742-6596/331/5/0... as they have a large use case.

Not a general kernel facility that I know of. I use nfscache every day though; my Steam data directory lives on NFS, and I set up nfscache with a 100GB LRU storage. This way I can avoid the "backup/restore" dance and have all my games installed, at the cost of waiting up to a few minutes to warm the cache for a new game.
I don't know about a manager per se but `bcachefs` for Linux seems to do a good chunk of what you're after.
I once evaluated using s3fuse for managing about 36 million images. The old storage model was on a filesystem so it was supposed to make a smooth transition to the cloud.

AWS Premium Support wisely advised me against it, not just because of latency but also because the abstraction makes /far/ more API calls then a native solution would.

After a bit of testing to confirm, I switched to using native API calls. That code was easy to write and the performance was great. I've been wary of cloud FUSE adapters ever since.

FUSE adapters in general are not for a product/production use in my experience. They’re great for one off convenience use, or basic admin scripts.
I'm working on optimizing FUSE using eBPF (ExtFUSE [1]) and adding a caching layer exactly as you mentioned. Will post publicly when ready.

1. https://github.com/extfuse/extfuse

Is work on this continuing (or restarting)? I had heard of this a few years ago, but thought the project was shelved.
The project is active (just not merged in the kernel yet). Please DM me for questions.
> What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts

This is really hard to get right if the origin cloud storage is anything other than immutable. Otherwise you're in for a world of cache invalidation and consistency pain.

I've gradually come round to the other opinion: there should be devices that sit on the PCIe/NVMe bus and provide a blob storage API rather than a block one, and there should be an operating system blob API that is similar to but not identical to the filesystem one.

Same experience. I remember opening a .docx in Word and watching it hang or studder at different operations. I think you'd need very reliable and low latency networking for this to be anything but a painful to use toy.

I'd be curious to see how it works running on EC2, especially with an S3 endpoint in the VPC. Although I still think you'd be better suited by using S3 as an object store, given the option to built it right.

Catfs is not super production (there are some small changes you need to make in inode handling), but you can do this. We have it on top of goofys. They both need a few changes to work under load but what we do is quite standard:

1. Goofys for S3 FUSE

2. Catfs for local disk caching

3. Linux caches in memory

4. Mmap file means processes share it

5. One device then exports this over the network to other machines, each of which have an application layer disk cache.

6. Machines are linked via 10 GigE (we use SFP+).

Overall the goofys and catfs guy (kahing) wrote very performant software. Big fan.

> most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache

Isn't this how most servers run normally? (parts of) files which are accessed are in page cache, the rest is on "disk"

That page shows a `mkdir` is 3 json commands. I wonder if its that many HTTP requests.
>>> every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation

gcsfuse latency is ok as it embodies "infinite sync & persistence" ;)

Well, and there's no such thing as opening a file and modifying some small part of it. That's emulated with a full rewrite of the whole object.
Uh, how does it perform from a Google Compute Engine Virtual Machine?

If it performs well there, I could imagine that being pretty useful.

That is exactly where I tested it, and the latency was still abysmally poor (~1 second per file operation).

I don’t even want to know how bad the latency would be outside of a cloud VM.

Moreover, there is no SLA on those FUSE adapters so putting it into any part of production is too risky.
My personal conspiracy theory: most "cloud services" are just... bad.

VMs and disk space I understand completely, having machines on-prem is too much of an hassle and the price isn't that bad. But for stuff like this, managed services, databases especially, you're just getting scammed.