Hacker News new | ask | show | jobs
by torginus 63 days ago
I mean we literally did this in one of my previous places. We took all the old laptops that were to be junked by IT, and used them as a selenium test farm. We saved like $100k per month on the AWS bill at the cost of basically electricity.

If all the machines were running Windows, the difference would've been even more drastic.

What I dont get is that we have these autoscaling technologies that allow software to be fault tolerant to hardware failure, yet companies still insist on buying expensive server grade HW for everything.

5 comments

Been through this recently in a fairly large enterprise

We have some in house software which runs in k8s. Total throughput peaks at about 1mbit a second of control traffic - it's controlling some other devices which are on dedicated hardware. Total of 24GB of ram.

The software team say it needs to run across 3 different servers for resilience purposes.

The VM team want to use neutronix as their VM platform, so they can live migrate one VM to another.

They insist on 25gbit networking, and for resilience purposes that needs to be mlagged

The network team also have to have multiple switches and routers, again for resilience.

So rather than having 3 $1000 laptops running bare metal kubes hanging off a pair of $500 1G switches eating maybe 200W, we have a $140k BOM sucking up 2kW.

When something goes wrong all those layers of resilience will no doubt fight each other. The hardware drops, so the VM freezes as it restored onto another host, so K8s moves the workloads, then the VM comes back, the k8s gets confused (maybe? I don't know how k8s works).

It's all needlessly overspecced costing 30 times as much as it should.

But from each individual team it makes sense. They don't want to be blamed if it doesn't work, they don't have to find the money. It's different departments.

One of my favorite bits of hardware is a UPS. I’ve played with several over the years, from fancy server-grade rack-mount APC stuff to inexpensive edge stuff. Without exception, downtime is increased by use of a UPS. I used to plug a server with redundant PSUs into the UPS and the wall so it could ride out UPS glitches.

Even today, a UPS that turns itself back on after power goes out long enough to drain the battery and is then restored is somewhat exotic. Amusingly, even the new UniFi UPSes, which are clearly meant to be shoved in a closet somewhere, supposedly turn off and stay off when the battery drains according to forum posts. There are no official docs, of course.

Sounds like crappy UPSes. Even the cheap old used eBay Eaton UPSes I have in my homelab have a setting for "Auto restart" and the factory default setting is "enabled".

But even rackmount UPSes are more of an "edge" sort of solution. A data center UPS takes up at least a room.

I assume that datacenters UPSes are better, but I’ve never used one except as a consumer of its output.

But I’ve had problems with UPSes that advertise auto-restart but don’t actually ship with it enabled. And that fancy APC unit was sold by fancy Dell sales people and supported directly by real humans at APC, and it would still regularly get mad, start beeping, and turn off its output despite its battery being fully charged and the upstream power being just fine (and APC’s techs were never able to figure it out either).

> I assume that datacenters UPSes are better [...]

I don't know about specific datacenter models, but in our colocation there are humans available 24/7. So the UPS might not start after failure, but there's a human to figure it out.

Most (all?) decent datacenters also have generators on site, and the intent is that the UPS will never run out of charge. So the fully-discharged case is an error and it might be intentional to require intervention to recover.
The funniest thing about huge enterprises is that they often have processes so convoluted and restrictive for everything, that getting stuff done by the book is basically impossible, so people get creative with the limitations and we often end up with the sketchiest solutions in existence.

I hope the words 'web server hosted in Excel VBA' illustrate the magnitude of horrors that can emerge in these situations.

Raspberry pi on a network controlled power supply to rebroadcast udp broadcast traffic across subnets
I saw an entire physical switch configured for bridging VLANs. It was even labeled as such. 802.1q is hard and confusing if you don't know what you're doing.
which is exactly why this being different departments makes no sense

one infra team - provides the entire platform

any other approach and you’re dicking around

Enterprise hardware has companies that your company can call to get support when things go sideways, if they're using a rack full of 5 year old Thinkpads then they're on their own if something breaks
I believe they are referring to the dumpster support model. The hardware is so cheap that, if it fails, you toss it in a dumpster and buy more by the gross. Using Kubernetes to spread loads across your less reliable nodes ensures high availability. Sometimes this can be even more reliable because you are regularly testing your recovery and backup features and your hardware is more varied.

The downside is that if some piece of firmware or hardware has a vulnerability you have a larger attack surface.

There's a ton of out-of-support enterprise gear racked up in data centers. It can be done if you have a plan to handle failures.

But that's still a lot easier than managing laptops, which are unwieldily in a DC for a lot of other reasons.

We didn't have support, and we didn't need it, as the hardware was essentially EOL, probably would've been sold for like 20% of new price. We just chucked Selenium grid on them, locked them in the storage room, and if they died, they died (they didn't die a lot tho, which is surprising, as we had quite a few cheap sketchy in there as well)
I can deconstruct my workflow to the point where the benefits of plugging outdated hardware into the project are calculable. Info, transformation, etc I don't need in near real time feels like it's trending towards the price of electricity.

Since I've been looking at this situation from a resource point of view for a bit I see obvious savings in slowing down certain accepted processes. For example, an entity that continuously updates needs to be continuously scraped while an entity that publishes once a day needs to be hit once a day.

Seems like they'd have to find another 5 year old Thinkpad.
> What I dont get is that we have these autoscaling technologies that allow software to be fault tolerant to hardware failure, yet companies still insist on buying expensive server grade HW for everything.

Simple: the cost of managing the hardware scales with its heterogenity and reliability. Even just dealing with the dozens of different form factors (air vent placement!) and power units of laptops would be a big headache.

> We saved like $100k per month on the AWS bill

Did you also compare the bill to places that are not AWS, not Azure, and not GCP?

I would agree with you about autoscaling if ECC was enabled in every consumer computer :'/