| Apparently, most of the ones categorized hard seem to be some thing related to hardware i.e. not a software mistake of some programmer. I would narrate a couple which were not the case. a) I used to work on deep packet inspection software for a multicore network processor. It was kind of c but with restricted api's and some unique concepts related to multicore. Among the concepts was, same binary being run on multiple cores to process packets, but still no hardware locks, because there was an implicit tag - a kind of a hash computed on 5 tuple (src/dst ip, ports, protocol) to ensure only one core gets packets from one session / 5 tuple. So the scenario was a protocol parser whose job was to parse some other info along with ip, call an external api to add a subscriber. When this parser was ran for like 10-15 minutes on live setup, it used to seg fault after processing some 60-70 million packets. The behavior was reproducible, but was not occurring at the same time, nor in the same piece of code. Narrowing down didn't exactly work, since it stopped occurring with either of the subscriber addition api call OR the parser was commented. But each worked perfectly on its own. Finally, after a couple weeks of long debug cycles and notes, it turned out to be AN IMPLICIT tag switch inside the subscriber addition api. Since we were not locking through apis, the tag switch would lead to same packet being sent to multiple cores, and any where along the line in the follow up code, an allocation (which turns redundant) or a shared mem access or deletion (free) it could turn into a seg fault. Now implicit switch of locks in the subscribe api was also a documented and needed feature of hardware. Just that it should have been DOCUMENTED in BOLD on the api, which was not the case. b) In the same dpi product, once we added two fields to look for in the incoming traffic which should not have matched but were still matching in results. Unique thing was, they only fail when those were together and would work fine independently. Going deeper in their code, showed a strncpy which was intended to use as a safety against strcpy, but with MAX_STRING_SIZE. So basically when the actual string was much shorter, it would just wipe off the entire length with padded zeros in the buffer, there by over writing the originally appended fields to look for. The author seemed to have missed the following comment in strncpy's definition. "If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it." Since then, i have been really careful in choosing to use strncpy instead of strcpy as often mistakenly advised in general. |