I feel like you are understanding the title as I first understood it, meaning like, "the external facing portions of infrastructure". However, reading the article, it seems clear that he's referring to the edges of a distribution curve (i.e. Infrequent events that impact experience nonetheless).
From the article: "It’s tempting to focus on the peak of the curve. That’s where most of the results are. But the edges are where the action is. Events out on the tails may happen less frequently, but they still happen. In digital systems, where billions of events take place in a matter of seconds, one-in-a-million occurrences happen all the time. And they have an outsize impact on user experience."
EDIT: IMO, the title is still a little annoying in this respect. I think everyone would agree if a request to your site fails 5% of the time, that is unacceptable, even though it "usually works." The discussion of the distribution curve simply to make the point that spikes in usage cause backed up queues which impact performance isn't necessarily helpful as far as I can tell, and it seems done largely in service to the title. In my mind while reading this, I'm thinking, "Okay, cool, but how does the fact that this interesting issue exists at the edge of the curve help me identify it?" Answer: It doesn't. If you see errors occurring, you will investigate them once they are noticed. Being at the edge of the curve may mean it takes longer to notice, but like, what kind of alerting system are you using that discriminates against rare issues in favor of common ones?
Discussing queues, over provisioning, back pressure, etc. are all super interesting and helpful.
From the article: "It’s tempting to focus on the peak of the curve. That’s where most of the results are. But the edges are where the action is. Events out on the tails may happen less frequently, but they still happen. In digital systems, where billions of events take place in a matter of seconds, one-in-a-million occurrences happen all the time. And they have an outsize impact on user experience."
EDIT: IMO, the title is still a little annoying in this respect. I think everyone would agree if a request to your site fails 5% of the time, that is unacceptable, even though it "usually works." The discussion of the distribution curve simply to make the point that spikes in usage cause backed up queues which impact performance isn't necessarily helpful as far as I can tell, and it seems done largely in service to the title. In my mind while reading this, I'm thinking, "Okay, cool, but how does the fact that this interesting issue exists at the edge of the curve help me identify it?" Answer: It doesn't. If you see errors occurring, you will investigate them once they are noticed. Being at the edge of the curve may mean it takes longer to notice, but like, what kind of alerting system are you using that discriminates against rare issues in favor of common ones?
Discussing queues, over provisioning, back pressure, etc. are all super interesting and helpful.