| Sure. Here's the implementation of the three versions: https://gist.github.com/2821937 Here's how I was originally benchmarking things: https://gist.github.com/2821943 Here's benchmarking using the Go testing package: https://gist.github.com/2821947 Some of the performance I was seeing on my crappy benchmarker evaporates using Go's benchmarker. But there is something else afoot. Try changing 'runs', which controls the size of the inner loop (needed to get enough digits of precision): From "runs = 128": BenchmarkSplit1 1 2478184000 ns/op
BenchmarkSplit2 1 2787795000 ns/op
BenchmarkSplit3 1 2747341000 ns/op
From "runs = 32": BenchmarkSplit1 1000000000 0.62 ns/op
BenchmarkSplit2 1000000000 0.68 ns/op
BenchmarkSplit3 1000000000 0.56 ns/op
Why did it suddenly jump from 1 billion outer loops to just 1? I think there is a bug in the go benchmarker here, because if you take into account the factor-of-4 difference in work and then divide by 1 billion, it looks like the first set of ns/op are actually correct and aren't being scaled correctly.Either way, the increase in performance is now only about 10%. Which I agree, isn't anything to write home about. More bizarre is that the bytes package one is faster for runs = 32 but not for runs = 128. I can't make head or tail of that, or why it should matter at all -- unless there is custom assembly in pkg/bytes that has odd properties inside that inner loop. But this is only one half of my complaint: it's the interface that matters, and I see no good reason for having IndexByte, but no CountByte and SplitByte, contrary to what you say about which is more fundamental. Having to construct a slice containing a single byte just to call SplitByte and CountByte left me with an bad taste in my mouth. |