Hacker News new | ask | show | jobs
by iso1631 63 days ago
Been through this recently in a fairly large enterprise

We have some in house software which runs in k8s. Total throughput peaks at about 1mbit a second of control traffic - it's controlling some other devices which are on dedicated hardware. Total of 24GB of ram.

The software team say it needs to run across 3 different servers for resilience purposes.

The VM team want to use neutronix as their VM platform, so they can live migrate one VM to another.

They insist on 25gbit networking, and for resilience purposes that needs to be mlagged

The network team also have to have multiple switches and routers, again for resilience.

So rather than having 3 $1000 laptops running bare metal kubes hanging off a pair of $500 1G switches eating maybe 200W, we have a $140k BOM sucking up 2kW.

When something goes wrong all those layers of resilience will no doubt fight each other. The hardware drops, so the VM freezes as it restored onto another host, so K8s moves the workloads, then the VM comes back, the k8s gets confused (maybe? I don't know how k8s works).

It's all needlessly overspecced costing 30 times as much as it should.

But from each individual team it makes sense. They don't want to be blamed if it doesn't work, they don't have to find the money. It's different departments.

3 comments

One of my favorite bits of hardware is a UPS. I’ve played with several over the years, from fancy server-grade rack-mount APC stuff to inexpensive edge stuff. Without exception, downtime is increased by use of a UPS. I used to plug a server with redundant PSUs into the UPS and the wall so it could ride out UPS glitches.

Even today, a UPS that turns itself back on after power goes out long enough to drain the battery and is then restored is somewhat exotic. Amusingly, even the new UniFi UPSes, which are clearly meant to be shoved in a closet somewhere, supposedly turn off and stay off when the battery drains according to forum posts. There are no official docs, of course.

Sounds like crappy UPSes. Even the cheap old used eBay Eaton UPSes I have in my homelab have a setting for "Auto restart" and the factory default setting is "enabled".

But even rackmount UPSes are more of an "edge" sort of solution. A data center UPS takes up at least a room.

I assume that datacenters UPSes are better, but I’ve never used one except as a consumer of its output.

But I’ve had problems with UPSes that advertise auto-restart but don’t actually ship with it enabled. And that fancy APC unit was sold by fancy Dell sales people and supported directly by real humans at APC, and it would still regularly get mad, start beeping, and turn off its output despite its battery being fully charged and the upstream power being just fine (and APC’s techs were never able to figure it out either).

> I assume that datacenters UPSes are better [...]

I don't know about specific datacenter models, but in our colocation there are humans available 24/7. So the UPS might not start after failure, but there's a human to figure it out.

Most (all?) decent datacenters also have generators on site, and the intent is that the UPS will never run out of charge. So the fully-discharged case is an error and it might be intentional to require intervention to recover.
Yeah, some people treat UPSes as "backup power" but that's not really what they're intended for. Their intended purpose is to bridge the gap during interruptions... either to an alternative power source, or to a powered-off state.
The funniest thing about huge enterprises is that they often have processes so convoluted and restrictive for everything, that getting stuff done by the book is basically impossible, so people get creative with the limitations and we often end up with the sketchiest solutions in existence.

I hope the words 'web server hosted in Excel VBA' illustrate the magnitude of horrors that can emerge in these situations.

Raspberry pi on a network controlled power supply to rebroadcast udp broadcast traffic across subnets
I saw an entire physical switch configured for bridging VLANs. It was even labeled as such. 802.1q is hard and confusing if you don't know what you're doing.
which is exactly why this being different departments makes no sense

one infra team - provides the entire platform

any other approach and you’re dicking around