Hacker News new | ask | show | jobs
by csdvrx 875 days ago
> One counter-intuitive thing we found is that it's slow to save and restore caches, but the machines have good CPU, so for us it's been faster to disable cache entirely and just redo everything on each build.

The link to SPDK was very interesting: https://www.ubicloud.com/blog/building-block-storage-for-clo.... I use filesystems for very high performance applications, and I've found ZFS to often be the limiting factor when compared to simpler solutions of XFS +- mdadm +- encryption.

It's a controversial point, but others have made similar findings: https://klarasystems.com/articles/virtualization-showdown-fr... : "Although I suspect this will surprise many readers, it didn’t surprise me personally—I’ve been testing guest storage performance for OpenZFS and Linux KVM for more than a decade, and zvols have performed poorly by comparison each time I’ve tested them"

OpenZFS seems to be starting to consider optimizations to better perform on modern drivers (SSD, NVMe) which have very different performance profiles to what ZFS was built for (spinning rust)

In the SPDK summary they say "To make VM provisioning times go faster, we changed our host OS from ext4 to btrfs" (...) "Also, when we switched the host filesystem to btrfs, our disk performance degraded notably. Our disk throughput dropped to about one-third of what it was with ext4."

Ubicloud: the problem seems to be generic to CoW filesystems, and it's interesting you came with a slight variation (CoA) but have you considered the even simpler alternative of any journaling filesystems (XFS, Ext4...) with overlays?

Or just UFS2 + snapshots to restore from a given state (initialized, ready for each test) then restore to this state between tests?

I think customers finding that disabling cache works better means the CoA has similar issues to CoW.

Personally, I'd have just tried to using SR-IOV with a namespace per customer, and call it a day instead of bringing extra complexity, but there must be good reasons for it. I'd love to know what these reasons are.

1 comments

In this case, I don't think the issue is due to filesystem performance. Someone from Ubicloud can correct me, but my understanding is that for custom runners Github still stores the cache on their side. So Ubicloud (in Europe) needs to transfer the cache from Github (in the US) on every run.
Hi, I work for Ubicloud.

Yes, that is correct. We are also working on implementing our own caching, which should speed up cache downloads/uploads significantly.