Hacker News new | ask | show | jobs
by beachstartup 4137 days ago
we don't exactly run the biggest operation, but in our experience the most common failure items in thousands-of-years-of-cumulative-uptime is network interface cards (or on-motherboard network interfaces) and hdd's.

RAID controllers fail left and right. we keep tons of spares around.

ssd's fail few and far between, cpus basically do not fail, and memory can go bad but it's exceedingly rare and easy to fix. psu's fail but are easy to fix in modern computers as well (slide-out, redundant, etc.)

having said all that, heat is the primary killer of hardware. if you run a lot of equipment in a dense environment, get a laser thermometer and find your hot spots and fix them with some industrial fans or move your gear around. once your stuff gets hot anything can fail in weird and mysterious ways.

3 comments

How do network cards fail? Simply as if you cut the link? Or can you see error counters increasing and sporadic frame loss that gets worse over time?
Depends on which bit fails, but increases in packet loss are a common early symptom of small components no longer acting within their specs.

Network cards are subject to lots of signal phenomena that are rare inside the chassis. Long cables are pretty good antennas for certain types of RF signals, so there are all kinds of electrical noises, induced power spikes and other miscellaneous garbage that the network card has to tolerate. Well-shielded cables can help protect the card, but it's definitely one interface that's subject to a bit more electrical abuse than the rest.

Components that have been stressed beyond their tolerances a few times can result in things like signal filters having a lower noise threshold, which makes it harder for the card to pick out the signal from the noise, which leads to more packet loss. After enough abuse, the threshold drops below the usable level and communications halt.

There are lots of factors involved, such as shielding, proximity to nearby radiators, bend radius in cables, cable length, temperature, etc, etc. Whenever I delve into this world, I'm often amazed that anything works at all.

> How do network cards fail?

All ways they can. I've found them with blown transistors, dead rats attached, no physical imperfections, etc.

Usually for me it's been some kind of hard failure, eg completely dead.

failure modes are all over the map. sometimes they just start dropping more and more packets, sometimes it "looks like it's working" but there's no layer 1 link light, sometimes it's incredibly high latency, sometimes the entire card just disappears from view.

this mostly happens with the on-board controllers. nics don't fail as often, but we do use high end nics (intel 10g and 4x 1g)

High-end consumer motherboards often include 2 integrated NICs. Over the last decade I've owned four and had one of the NICs fail after 2-3 years on every single motherboard. Glad to know it's endemic, and Danpat's explanation is fascinating.
> RAID controllers fail left and right. we keep tons of spares around.

Kind of scary. I would guess the replacement should be perfectly identical, to the last firmware bit (... and giving thanks that no subtle circuit timing factors are involved).

Network cards? HDDs and power supplies are my most common deaths in the server room.
Same. RAM chasing them up too. Almost never had a network card failure.