Hacker News new | ask | show | jobs
by jzelinskie 1846 days ago
I hadn't realized that Gogo was in such a bad spot with the upstream Go protobuf changes. There was lots of drama when the changes were made and I guess that overshadowed any optics I had on Gogo.

Making vtprotobuf an additional protoc plugin seems like the Right Thing™, although it's a shame how complicated protoc commands end up becoming for mature projects. I'm pretty tempted to port Authzed over to this and run some benchmarks -- our entire service requires e2e latency under 20ms, so every little bit counts. The biggest performance win is likely just having an unintrusive interface for pooling allocated protos.

2 comments

Proto message unmarshal in Go for a small message should be 5 orders of magnitude below 20ms, shouldn't even begin to matter until you are sweating individual microseconds.
That's true if your program only does a single unmarshal at a time at a leisurely pace. And in a steady state situation, the memory trashing left behind each individual unmarshal call needs to be paid up by some poor future request.

I agree it's unlikely the difference here will be solely responsible for tipping the GP's request above 20ms, but the memory problems could reasonably ruin tail latencies.

The significance of 20ms isn't clear so this is hard to judge.

Perhaps they have significant external (network) latency leaving only a few ms budget for the application stack - so they could easily be up against a wall.

Until the GC kicks in and steals a full 200usec + a bunch of your throughput...

(Holy shit, who is downvoting this? It's literally the whole article!)

If your path is sensitive to 200us of latency you should probably optimize your application and tune your GC. Typically 200us for freeing all unreachable memory is not a big deal.
> If your path is sensitive to 200us of latency you should probably optimize your application and tune your GC.

okay, you've done this, three years later and it's the same thing again since you need to accomodate the new features. your users haven't upgraded their computers. what do you do ?

Run a profiler and optimize again.
Your original code is already optimized as much as is possible outside of the things mentioned by OP
Properly written Go code (or even Java for that matter) will try to minimize allocations. For Java, unless I am mistaken pause-less GC is only offered by Azul - $$
>or even Java

Just in case you may be unaware, the latest GCs for Java (Shenandoah, ZGC) are miles ahead of anything available for Go due to sheer age and manpower. Parallel and Pauseless are easily achievable in most cases.

> Latest GCs for Java (Shenandoah, ZGC) are miles ahead of anything available.

Beyond hyperbole, do you have any actual comparison of Go vs Java GC performance?

Java's GC is better but Go's GC is also parallel and "pauseless" - iirc ZGC is 50-500usec which is comparable to Go's target 200usec.

The point is, neither is "five orders of magnitude" below 20ms. And neither needs zero CPU even if it doesn't block other threads.

Yeah, the whole point of the article is that gRPC v2 (and frankly v1 for that matter) are not “properly written” to do this.
3% regression in QPS, 20% regression in CPU, and 5% regression in memory usage according to the article. Those are considerably worse than "5 orders of magnitude below".
GP meant 5 orders of magnitude below "20 ms". 20 ms is a lot of time.

There is nothing one can do to a, say, a 1 kilo byte buffer that will cross 1 ms in any language. My own Go code doesn't cross more than few micros per message.

GP's root claim is that protobuf serialization/deserialization performance shouldn't matter, on an article where a user is specifically demonstrating that it does matter.
The usecase described in the article, and the usecase described in the top post in this thread aren't the same usecase. If you aren't throughput bound, a 5% regression in parse speed doesn't matter if your goal is to stay under 20ms and parsing takes 17 us. Sure it now takes 19 us, which is a regression of 2 us out of 20ms, or 1/10000th of your time.
> our entire service requires e2e latency under 20ms

Why are you using Go then?

20ms is a pretty considerable amount of time WRT E2E transaction time in today's world. Can you expand on your concerns with Go?
It's not really suitable for latency-critical applications.

EDIT: Fixed unfortunate typo

You can 100% write services with P999 < 20ms in go. Not even trying that hard. Go is entirely suitable for this kind of constraints, I dare say that's go's main target.

P99 < 1ms, that's when you're going to want to switch it up.

Depending on workload, Go also does sub-1ms p99 pretty easily. I'm getting sub-1ms p99.9.
What are the proposed solutions to get better than that? C/Rust code? Assembly?
was the double-negative intentional? I've used Go for sub-millisecond needs. So 20ms seems like it would be a reasonable choice from where I'm sitting.
It was not intentional, thanks for asking...very unfortunate typo ;)

Go doesn't give you control over inline vs indirect allocation, instead relying on escape analysis, which is notoriously finicky. Seemingly unrelated changes, along with compiler upgrades, can ruin your carefully optimized code.

This is especially heinous because it uses a GC; unnecessary allocations have a disproportionately large impact on your application performance. One or the other wouldn't be nearly as bad.

Time and time again we see reports from organizations/projects with perfectly fine average latency, but horrendous p95+ times, when written in Go - some going as far as to do straight-up insane optimizations (see Dragph) or rewrite in other languages.

But you think this impacts a 20ms budget? It’s mostly trivia to get sub 20ms p99 in Go.
While escape analysis in Go is finky, you can make it part of the CI/CD to keep it under control.

https://medium.com/a-journey-with-go/go-introduction-to-the-...

No different than running other kinds of static analysis for well known languages, unsafe by default.

I don't know, I'm able to get 150k grpc q/sec with p99 sub 1ms. It's def better than G1 and CMS.