Hacker News new | ask | show | jobs
by latchkey 1254 days ago
I created a system that booted 12k+ diskless blades via PXE and running Ubuntu (it was built to scale to 30k+, but we never got there).

This generally works well, but I'd say there are about 0-20 blades that crash a day due to some sort of memory corruption issues.

Due to the fact that I was operating remotely from the hardware, I never really got a chance to resolve it... also... just a simple reboot would fix it (and the blades booted in ~60 seconds, so it wasn't a huge issue).

So, on large enough scale... this can be an issue to consider.

1 comments

Is that caused or exacerbated by being diskless, though? Or is it just inevitable that 12k+ machines are going to have a certain rate of memory errors regardless?
That's the thing, I don't know. It could be a whole bunch of issues, but I thought it was interesting to note what I see at scale while doing this.