Hacker News new | ask | show | jobs
by amluto 887 days ago
I wish someone would try to scale the nodes down. The system described here is ~300W/node for 10 disks/node, so 30W or so per disk. That’s a fair amount of overhead, and it also requires quite a lot of storage to get any redundancy at all.

I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.

This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.

I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

9 comments

I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's [0]. Was a royal pain to get installed (armhf processor). But once it was running it worked great. Just slow with the single 1Gb NIC.

Was just a learning experience at the time.

[0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/

Same here, but on PI 4b's. 6 node cluster with a 2tb hdd and 512 Tb ssd per node. CEPH made a huge impression on me, as in I didn't recognize how extensive the package was. I went up to 122mb/s and thought it's too little for my hack-NAS replacement :)

The functionality: mixing various pool types on the same set of SSD's, different redundancy types (erasure coded, replicated) was very impressive. Now I can't help but look down at a RAID NAS in comparision. Still, some extra packages like the NFS exporter were not ready for the arm architecture

Nvidia's SODIMM compute module interface can prove this concept already. I have two 7W ARM Turing RK1s arriving soon, each with PCIe 3x4 at 4GB/s, and the Turing Pi 2 cluster board can fit four in an ITX form factor. I'm expecting over 3Gbps per watt at a total cost of 820USD

PCIe lanes are the bottleneck so far - even my $90 2TB SSDs are rated at 7GB/s on PCIe 4x4. So I don't think SBCs are the optimal solution yet. Looks like Ampere's Altra line can do PCIe 4x128 at 40W so a 1U blade with 100G networking could be interesting. I've seen lots of bugs and missing optimisations with ARM though, even in a homelab, so this kind of solution might not be ready for datacenters yet

10 Gbps is increasingly obsolete with very low cost 100 Gbps switches and 100Gbps interfaces. Something would have to be really tiny and low cost to justify doing a ceph setup with 10Gbps interfaces now... If you're at that scale of very small stuff you are probably better off doing local NVME storage on each server instead.
here's a weird calculation:

this cluster does something vaguely like 0.8 gigabits per second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per tb / 34 nodes / 300 watts

a new mac mini (super efficient arm system) runs around 10 watts in interactive usage and can do 10 gigabits per second network, so maybe 1 gigabit per second per watt of data

so OP's cluster, back of the envelope, is basically the same bits per second per watt that a very efficient arm system can do

I don't think running tiny nodes would actually get you any more efficiency, and would probably cost more! performance per watt is quite good on powerful servers now

anyway, this is all open source software running on off-the-shelf hardware, you can do it yourself for a few hundred bucks

You're comparing one machine with many machines.

You're comparing raw disks with shards and erasure encouraging.

Lastly, you're comparing only network bandwidth and not storage capacity.

I think the Mac Mini has massively more compute than needed for this kind of work. It also has a power supply, and computer power supplies are generally not amazing at low output.

I’m imagining something quite specialized. Use a low frequency CPU with either vector units or even DMA engines optimized for the specific workloads needed, or go all out and arrange for data to be DMAed directly between the disk and the NIC.

> or go all out and arrange for data to be DMAed directly between the disk and the NIC.

Ceph OSDs do a lot more work than you're imagining.

sounds like a DPU (mellanox bluefield for example), they're entire ARM systems with a high speed NIC all on a PCIe card, I think the bluefield ones can even directly interface over the bus to nvme drives without the host system involved
That Bluefield hardware looks neat, although it also sounds like a real project to program it :).

I can imagine two credible configurations for high efficiency:

1. A motherboard with a truly minimal CPU for bootstrapping but a bit beefy PCIe root complex. 32 lanes to the DPU and a bunch of lanes for NVMe. The CPU doesn’t touch the data at all. I wonder if anyone makes a motherboard optimized like this — a 64-lane mobo with a Xeon in it would be quite wasteful but fine for prototyping I suppose.

2. Wire up the NVMe ports directly to the Bluefield DPU, letting the DPU be the root complex. At least 28 of the lanes are presumably usable for this or maybe even all 32. It’s not entirely clear to me that the Bluefield DPU can operate without a host computer, though.

I checked selling prices of those racks + top end SSDs, this 1Tb/s achievement runs on $4 million worth of hardware cluster. Or more I didn't check the networking interface costs.

But yeah could run on commodity hardware. Not sure those highly efficient arm packaged for a premium from Apple would beat the Dell racks though regarding throughput relative to hardware investment costs.

Dell’s list prices have essentially nothing to do with the prices that any competent buyer would actually pay, especially when storage is involved. Look at the prices of Dell disks, which are nothing special compared to name brand disks of equal or better spec and much lower list price.

I don’t know what discount large buyers get, but I wouldn’t be surprised if it’s around 75%.

Agreed and the specs in the story in fact show they didn't provision add-ons such as specific SSDs from dell.

Still well over $1M for the cluster.. skeletons of racks with just CPUs and ram.

Trusting your maths, damn Apple did a great job on their M design.
Didn't ARM (the company, that originally designed ARM processors) do most of that job and Apple pushed perf to consumption even further?
I think the chief source of inefficiency in this architecture would be the NVMe controller. When the operating system and the NVMe device are at arm's length, there is natural inefficiency, as the controller needs to infer the intent of the request and do its best in terms of placement and wear leveling. The new FDP (flexible data placement) features try to address this by giving the operating system more control. The best thing would be to just hoist it all up into the host operating system and present the flash, as nearly as possible, as a giant field of dumb transistors that happens to be a PCIe device. With layers of abstraction removed, the hardware unit could be something like an Atom with integrated 100gbps NICs and a proportional amount of flash to achieve the desired system parallelism.
Is that a lot of overhead? The disk itself uses about 10W and high speed controllers use about 75W leaves pretty much 100W for the rest of the system including overhead of about 10%. Scale up the system to 16 disks and there’s not a lot of room for improvement
I have always wanted to set up a ceph system with one drive per node. The ideal form factor would be a drive with a couple network interfaces built in. western digital had a press release about an experiment they did that was exactly this, but it never ended up with drive you could buy.

The hardkernel HC2 SOC was a nearly ideal form factor for this, and I still have a stack of them laying around that I bought to make a ceph cluster, but I ran out of steam when I figured out they were 32bit. not to say it would be impossible I just never did it.

That would be perfect. Unfortunately, going by the data sheet it would not run ceph you would have to work with seagate's proprietary object store. I will note that as far as I can tell it is unobtainium. none of the usual vendors stock them, you probably have to prove to seagate that you are a "serious enterprise customer" and commit to a thousand units before they will let you buy some.
I used to use Ceph Luminous (v12) on these, they worked fine. Unfortunately, a bug in Nautilus (v14) prevented 32-bits and 64-bits archs from talking to each other. Pacific (v16) allegedly solves this, but I didn't try it: https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/

If you want to try it with a more modern (and 64-bits) device, the hardkernel HC4 might do it for you. It's conceptually similar to the HC2 but has two drives. Unfortunately it only has double the RAM (4GB), which is probably not enough anymore.

Looks so good, wish for a > 1gbit version, since HDDs alone can saturate that
Did you look at their H3? It's pricier but it has two 2.5Gbits ports (along with a NVMe slot and an Intel CPU)
I have one and love it! It bravely holds together my intranet dev services :)

For a ceph node would still consider a version with 10gbit eth

There probably is a sweet spot for power to speed, but I think it's possibly a bit larger than you suggest. There's overhead from the other components as well. For example, the Mellanox NIC seems to utilize about 20W itself, and while the reduced numbers of drives might allow for a single port NIC which seems to use about half the power, if we're going to increase the number of cables (3 per 12 disks instead of 2 per 5), we're not just increasing the power usage of the nodes themselves put also possible increasing the power usage or changing the type of switch required to combine the nodes.

If looked at as a whole, it appears to be more about whether you're combining resources at a low level (on the PCI bus on nodes) or a high level (in the switching infrastructure), and we should be careful not to push power (or complexity, as is often a similar goal) to a separate part of the system that is out of our immediate thoughts but still very much part of the system. Then again, sometimes parts of the system are much better at handling the complexity for certain cases, so in those cases that can be a definite win.

IIRC, WD has experimented with placing Ethernet and some compute directly onto hard drives some time back.

sigh I used to do some small-scale Ceph back in 2017 or so...