Hacker News new | ask | show | jobs
by lgeek 110 days ago
I hate it when I can't get in touch with the right engineers at a large company. This (especially the highly targeted ad mentioned on the page) is a very creative way to try to solve that problem.

Not associated with Meta, but this piqued my interest. That being said, I found some parts confusing and hard to follow. For example what does URPF (Unicast Reverse Path Forwarding) in the title of this submission have to do with the contents?

And is the packet loss supposedly happening at specific times only? It's not mentioned anywhere, but one screenshot highlights the time. I couldn't reproduce the packet loss using any of the looking glasses and dest IP addresses in the screenshots. At this point, if this was a report I had received about one of my services, I would have probably bumped down the priority to low and asked for a reproducible test, because in my experience even issues that affect a single path in an ECMP group are not this hard to reproduce. I think it's way more important to give the engineer who will process the report an easy way to check that there is indeed a problem than to start to teach how traceroute works.

TBF, there does seem to be an issue somewhere, because sticking 129.134.80.234, one of the Meta IP addresses from a screenshot, on ping.pe does definitely show significant packet loss from more locations than you'd expect to see for an address with no connectivity issues.

2 comments

Addressing the question of what uRPF has to do with this: it’s possible—unlikely, but possible—that Meta hasn’t been able to find the issue because while it looks like a faulty interface within a bundle (and it probably is), it could also be that an internal route has uRPF accidentally enabled and is receiving asymmetric traffic, causing it to drop packets on that path. It’s a possibility, but only Meta would know for sure. I included it in the title to give them a lead; it can really only be one of two things: uRPF on an interface participating in an ECMP, or an interface dropping packets at the hardware level within a bundle
Hey man. I agree, this issue has been going on for nearly six months, and they’ve been closing my tickets—it’s honestly a joke at this point. Back in 2023, the exact same thing happened, and I had to resort to social engineering just to get them to find the problem; they fixed it a day later. I’m not proud of doing that, but I have to emphasize it because Meta has built performance dashboards designed to delude themselves.

Packet loss is happening all the time, though it might be more noticeable during peak hours since a faulty interface will show a higher error rate under heavy load. You can replicate it using looking glasses; maybe you didn't see it five days ago but you do now. Since it’s an ECMP issue, it depends heavily on which source and destination servers you’re testing. It’s just a matter of iterating.

I’m glad you were able to replicate it on ping.pe; Meta, however, still has no clue