| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by toast0 10 days ago

I'm 22 years into real professional software development (I've semi-retired, but the world still needs debugging). Plus a few years of junior level IT/sysadmin stuff. Of course, my code is perfect by now. But my code runs on other people's code. And nobody else writes perfect code.

So I have to debug other people's libraries and operating systems. And other people's networks. Turns out other people often make similar mistakes. Some people say 'select isn't broken', but lots of things are[1]. Most of my debugging stories would tend to be centered around a problem that my team found/uncovered, not one that we created... although certainly I did make some bugs in my youth (definitely none lately!).

I put 5-10 years there because someone under 5 years of experience could maybe not have ever run into a troublesome issue, or they always had a senior to do the hard stuff. Between 5 and 10 years, maybe they find their first tricky bug. After 10+ years, you've got to have run into something.

[1] Here's some war stories:

I fixed an interop issue between OpenSSL and Microsoft schannel where rsa dhe would fail if the generated public key had leading zeros; OpenSSL would encode it in fewer bytes and schannel would return 'out of memory'. The RFC was vague. People had observed the failures for years, but I had to fix it. At the time, it was considered a reasonable optimization to generate a dhe keypair and reuse it for the lifetime of the server process... If we generated a problematic keypair on a given server, windows clients couldn't connect at all. Now, if I run into an issue and a working trace has structures of nice power of 2 lengths and a broken trace has one a little smaller, that's where I dig.

I found (but didn't make a patch) a bug in Firefox where POSTs to an http/2 server with tls 1.3 early data enabled would stall for about a minute when there was no connection to reuse. Fixing it was out of my league, but I was able to get it fixed by giving a clear bug report. This one was fairly new when I saw it, but there was a much less clear bug open against Thunderbird caused by the underlying issue. Not sure what I learned here really other than if you're expecting data sent to the network and it doesn't happen, it's usually an application problem... and clear bugs with clear logs help get things fixed.

I fixed an issue with FreeBSD where it would send the whole sendq when it received an icmp needs frag message, even when the maximum mtu sent in the icmp was the same as or greater than the current path mtu. This was happening when a Linux router was using large receive offload to aggregate inbound packets on a flow and then they were too large to forward; that bug was fixed long before I experienced it, but the router in question never got updated. I could not get ahold of the operator for them to fix the broken machine, but I was able to get a patch into FreeBSD so that the broken router(s) only impacted our customers that were behind it. ... this is another indication that PathMTU is hard, but also it helped me tune methods of sampling packets from production. PS, pathmtu issues are their own repetitive problem space.

That one time FreeBSD broke syncookies, so connections got reformed after close, and the tcp state was unsynchronizable between peers so they kept sending challenge acks... and IIRC, they broke it a second time, too. But maybe it was just we ran into it in a different context.

I've recently found some issues leading to out of order packet delivery with FreeBSD's dummynet traffic shaping; again, other people already experienced it, but nobody wrote a good bug report or submitted a patch, so I guess I'll have to do it, if it's still broken when I have time for it. This one is probably not going to be a repeating bug... not a lot of traffic shapers, but maybe there will be something learned about scheduling i/o

What processes could I use to avoid bugs like these? Hoping things magically get fixed in an update does sometimes work, and sometimes the bug becomes less relevant as the industry moves on (ecdhe has almost completely replaced rsadhe, but it hadn't at the time that my customers ran into the bug).

1 comments

win311fwg 10 days ago

> So I have to debug other people's libraries

Been there, but then I soon learned to reduce dependencies to essentially none. Those that do make the cut need to be of high quality such that the authors of those libraries are also as perfect as you are. There is absolutely no need to depend on code written by those with <=10 years under their belt. The world is full of developers with 20+ years of experience.

> After 10+ years, you've got to have run into something.

Sure. And after 10+ years of flipping burgers, there will have been some pretty sweet lands. Who is going to remember, though? It is fine if you do. Everyone has their thing. But I'd say it is not exactly among the most memorable of events. It is not like time spent with your child, or something that actually has some kind of meaning. You even say you are semi-retired, so you must agree that things on the job don't really matter. If it did, why not dedicate every possible moment to it?

link