Hacker News new | ask | show | jobs
by ernado 2245 days ago
> we caught a significant race condition

It is a data race, not a race condition.

> and which passed the race checker for Go

No, it is not. https://github.com/helm/helm/pull/7820#issuecomment-60436062...

There is a comment by issue author which is literally a go data race detector warning. Like "WARNING: DATA RACE".

4 comments

Data races are a kind of race condition, no?
I can convince myself that data races need not be a race condition. Consider this simple program:

    var i int
    doneCh := make(chan struct{})
    go func() { i = 1; doneCh <- struct{}{} }() // a
    go func() { i = 1; doneCh <- struct{}{} }() // b
    <-doneCh
    <-doneCh
At the end of the program, i is always equal to 1 no matter which order a or b wrote to i. But it's a race because you are assigning to a shared variable without synchronization. A small modification to the program creates a race condition:

    var i int
    doneCh := make(chan struct{})
    go func() { i = 1; doneCh <- struct{}{} }() // a
    go func() { i = 2; doneCh <- struct{}{} }() // b
    <-doneCh
    <-doneCh
Is i 1 or 2? It depends.

It is correct for the race checker to complain about the first program, because after a bit of hacking the first program can very easily change into the second program.

(And I tried it, and it does complain.)

I wasn't sure. After a bit of research, this seems to be a debate [1]. Using the common definitions, it's possible to have a data race that doesn't cause a race condition. [2] It's also possible to have a race condition without a data race.

[1] https://en.wikipedia.org/wiki/Race_condition#Data_race

[2] https://blog.regehr.org/archives/490

I would argue that "a data race that doesn't cause a race condition" is still, itself, a tiny race condition- just a contained one.

But you're right, this is just choice of terminology. :)

Also, to be clear, by “we” they really mean “a contributor”
IIUC, the point is that the code has been in prod for a year, but the race detector only just now found the bug? But I could be wrong.
It is right, race detector is not enabled by default and you should explicitly run tests with it or tell compiler to enable it - it is not compile-time, but run-time.

But still, it detects this error.

It is a data race. I'm guessing the race detector (go test -race) didn't detect it because they are layering multiple synchronization primitives (mutexes, channel i/o, and a WaitGroup) and their tests hit the "good" code path but production workloads didn't.

Here's what happens. Delete takes a ResourceList. It delegates to "perform" and then "batchPerform". perform calls batchPerform in a separate goroutine, which calls a helper function in another goroutine for every resource in the ResourceList. The helper function is defined in Delete and updates a data structure defined in Delete. This is a classic case where some synchronization is necessary. The function runs multiple times in multiple goroutines, and updates a single shared structure. (Perhaps not obvious because it delegates to two helper functions, and the list that the function is executed on is a "ResourceList" not a []Resource, so it isn't clear that there is a "for { go func() }" loop anywhere; the programmers did their best to make it non-obvious that a loop is occurring.)

The confounding factor here is that batchPerform tries to synchronize with a WaitGroup, but it's faulty and not enough to protect the data integrity. batchPerform creates a WaitGroup, but only calls Wait() on the WaitGroup when the "kind" of an individual resource is not equal to the "kind" passed to batchPerform. I am guessing that it's very natural to craft some test data where this condition is met, and the for loop in batchPerform only runs the function once at a time (perhaps a ResourceList of length 1). In that case, there is no race condition for the race detector to detect.

All in all, if I were reviewing this code, it would not be checked in its current form. Splitting perform and batchPerform doesn't make sense to me, and they both implement faulty synchronization logic in a slightly different way. (batchPerform uses "for { wg.Add(); go f() }; wg.Wait", perform does "for range x { go func() { ch <- f() }() }; for range x { <- ch }". I consider these pretty much exactly equivalent, but neither prevents f() from running concurrently with itself. The only reason this passed the race checker is because batchPerform doesn't actually use the WaitGroup in the normal way, instead degrading to "for range x { wg.Add(); go f(); wg.Wait() }", which DOES prevent f() from running concurrently with itself, with certain inputs.

The root cause is that the caller of Delete isn't really sure about the semantics of "perform". Does it protect the body of the callback function? There is no documentation, and the author thought "yes". But the answer was "no". In general, the convention in go is to consider something thread-unsafe unless it's marked as thread safe. When you see something like "var foo Foo; f(list, func(bar){ foo = bar })" your spidey sense should be concerned about synchronization. But in this case, the code went out of its way to hide the existence of a loop and the existence of parallel processing, and so the programmer made a mistake. A bug or at least VERY confusing use of WaitGroup in batchPerform allowed the tests to pass. Should the compiler detect this? It would be nice. But a code reviewer should have been super concerned about this implementation.