Hacker News new | ask | show | jobs
by CodeBrad 1896 days ago
In my opinion, the main limitation of TSan is that it cannot find all possible races. (An impossible task so I think TSan is a pretty great tool to get to the point that this is the main limitation)

As a dynamic tool, if a race is observed during execution by TSan, the race is very likely to be real. But it is impossible to "observe" execution for every part of a large code base, and so real races are likely still hiding in some edge case that TSan did not execute.

Static Analysis tools are interesting in that they have the opposite problem. Static tools can detect races across the entire code base, but without runtime information it is difficult to know if the races are real. This leads to more real races being detected, but also comes with more false positives.

There is a lot of interesting working being done on data race detection and how we can bring these two sides closer together to find more real races, without overwhelming developers with false positives.

2 comments

This is absolutely true and hence we combine not only our tests with TSan, but also fuzzing, to explore even more corner cases.

On the static vs. dynamic side, I would always opt for the dynamic when it can guarantee me no false positives, even if the results are incomplete. It is pretty much impossible to deploy a tool that produces lots of false positives because developers usually will reject it at some point and question every result.

> In my opinion, the main limitation of TSan is that it cannot find all possible races.

In my experience TSan finds all possible races if it's built with unit tests and unit tests provide enough coverage. If unit tests don't provide enough coverage then there are likely other bugs lurking beyond just data races.

I've never actually used TSan, so please correct me if I've missed something significant about how it works, but doesn't TSan need to observe the race happening in order to detect it?

The code coverage analysis I've seen in the past has only evaluates what code is run at least once, but how do you get coverage analysis to evaluate the coverage of what code runs concurrently with what other code effectively enough to get confidence that you'll detect data races?

For example, if you've got a cache that safely handles concurrent use for most operations, but has a data race when two threads both cache miss on the same key within a short window of each other, are there kinds of code-coverage analysis that could theoretically point out that none of your tests involve a cache miss of the same key in two different threads?

I've seen concurrency permutation testing tools, that can reorder the sequence of operations between different threads searching for a permutation that causes a data race, but that relies on actually executing the potentially-racy behaviour at all during testing, right?

I guess this is also what fuzzing is about, but are there tools or techniques you can use to increase your confidence in how well your tests cover the space of concurrent use of your data structures?

TSan uses run-time instrumentation, yes.
Unit tests rarely cover high-load concurrency scenarios. I used to think unit test coverage was important, but I've found using the type system to make errors impossible is a much more effective way of lowering the defect rate.
They're not mutually exclusive concepts. :)
Given the project management triangle they kind of are - the goal should be to achieve the required defect rate at minimum maintenance cost. A lot of people talk as though more tests or more verification is always a good thing, but zero defects is usually not a business priority if it means slowing down feature work. So adding more different kinds of verification should be a last resort, only when you actually require a lower defect rate than you can achieve with one method alone - which is pretty rare IME.
> the goal should be to achieve the required defect rate at minimum maintenance cost

I disagree. Maintenance cost doesn't need to be minimized. It needs to be reasonable.

The goal should be to produce high quality software. High quality software is easy to understand and easy to instrument with tools.

Much as I enjoy writing high quality software, it's irresponsible to spend time or money producing something higher quality than is justified by the needs of the business. Generally I find that unit tests for anything other than actual logic (which is a tiny fraction of most codebases) don't offer enough benefit to be worth the costs.

(I'd also argue that unit testing can reduce the quality of the software as it makes the internals of the architecture more costly to change).

Unit tests tend to be written to run without parallelism, or at least with less parallelism than the normal app has.
Not my unit tests. :-)
I agree, good code coverage can probably get most of the way there. But there are some problems with relying on coverage.

Coverage can be misleading. Maybe there is some function being executed in parallel:

  int global;
  void foo() {
    int local;
    
    if (complex_condition)
      bar(&local);
    else
      bar(&global);
  }
And assume maybe somewhere way down the call chain the data passed to bar is modified. The bar function and all of the functions called in bar can have 100% test coverage, but if the else branch is not covered, the race will be missed.

So without true 100% test coverage, it is possible races are being missed by dynamic tools, and true 100% test coverage is probably not possible or feasible in large complex projects.

> but if the else branch is not covered, the race will be missed.

Yes.

> true 100% test coverage is probably not possible or feasible in large complex projects.

I have yet to encounter a single large complex project where 100% test coverage is not possible. I have encountered such projects where 100% test coverage required massive refactoring of the code to facilitate testing and doing so resulting in fixing most customer complaints received.

So you truly believe that Firefox or Windows or Ubuntu could not only be 100% covered by unit tests, but that it would actually be worth taking the time to do so? (edit: typos)
As someone who has worked on the code of Firefox, I strongly doubt that.

For instance, pretty much every piece of code (whether native or JS) I've seen is programmed defensively, assuming that for some reason invariants can be broken - something that typically cannot happen unless there is a memory corruption involved. Testing all these defensive programming branches is not possible without introducing memory corruptions.

Yes, I firmly believe that Firefox, Windows, and Ubuntu can achieve 100% unit test coverage of software. I truly believe it would be worth its while.
As Yoric mentioned, that would require some custom Qemu patch to corrupt memory, some test boxes with dodgy RAM, and/or other things along those lines.

Removing those "impossible" code branches is a bad idea. "Impossible" things happen millions of times a day to Firefox.

I used to work for LimeWire, which had tens of millions of users. We were sometimes seeing bug reports of "impossible" stack traces. They went down dramatically when we added a splash screen that calculated SHA-1 hashes of all of the DLLs and JARs before using the reflection API to load and start the actual application. Given that the "corrupted" JARs were still presumably passing the JARs' CRC-32 checksums and bytecode verification at class load time, I think we were most often just detecting dodgy RAM that would later have caused "impossible" errors by corrupting bytecode or native code after bytecode verification.

Even disregarding tne very good points about code defending from memory errors mentioned by others, I imagine a realistic time frame for such coverage for a project the size of Firefox or Windows to be no less than a decade, with 0 other development going on. I find it absurd to imagine it would be worth it, even if it reduced bug count to 0 (which it wouldn't).