Hacker News new | ask | show | jobs
by jrockway 2244 days ago
It is a data race. I'm guessing the race detector (go test -race) didn't detect it because they are layering multiple synchronization primitives (mutexes, channel i/o, and a WaitGroup) and their tests hit the "good" code path but production workloads didn't.

Here's what happens. Delete takes a ResourceList. It delegates to "perform" and then "batchPerform". perform calls batchPerform in a separate goroutine, which calls a helper function in another goroutine for every resource in the ResourceList. The helper function is defined in Delete and updates a data structure defined in Delete. This is a classic case where some synchronization is necessary. The function runs multiple times in multiple goroutines, and updates a single shared structure. (Perhaps not obvious because it delegates to two helper functions, and the list that the function is executed on is a "ResourceList" not a []Resource, so it isn't clear that there is a "for { go func() }" loop anywhere; the programmers did their best to make it non-obvious that a loop is occurring.)

The confounding factor here is that batchPerform tries to synchronize with a WaitGroup, but it's faulty and not enough to protect the data integrity. batchPerform creates a WaitGroup, but only calls Wait() on the WaitGroup when the "kind" of an individual resource is not equal to the "kind" passed to batchPerform. I am guessing that it's very natural to craft some test data where this condition is met, and the for loop in batchPerform only runs the function once at a time (perhaps a ResourceList of length 1). In that case, there is no race condition for the race detector to detect.

All in all, if I were reviewing this code, it would not be checked in its current form. Splitting perform and batchPerform doesn't make sense to me, and they both implement faulty synchronization logic in a slightly different way. (batchPerform uses "for { wg.Add(); go f() }; wg.Wait", perform does "for range x { go func() { ch <- f() }() }; for range x { <- ch }". I consider these pretty much exactly equivalent, but neither prevents f() from running concurrently with itself. The only reason this passed the race checker is because batchPerform doesn't actually use the WaitGroup in the normal way, instead degrading to "for range x { wg.Add(); go f(); wg.Wait() }", which DOES prevent f() from running concurrently with itself, with certain inputs.

The root cause is that the caller of Delete isn't really sure about the semantics of "perform". Does it protect the body of the callback function? There is no documentation, and the author thought "yes". But the answer was "no". In general, the convention in go is to consider something thread-unsafe unless it's marked as thread safe. When you see something like "var foo Foo; f(list, func(bar){ foo = bar })" your spidey sense should be concerned about synchronization. But in this case, the code went out of its way to hide the existence of a loop and the existence of parallel processing, and so the programmer made a mistake. A bug or at least VERY confusing use of WaitGroup in batchPerform allowed the tests to pass. Should the compiler detect this? It would be nice. But a code reviewer should have been super concerned about this implementation.