| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cmeiklejohn 979 days ago

OpenTelemetry's Java implementation does this, but it actually does it in a way that non-GRPC things can access this context as well by ensuring that it propagates throughout both the CoroutineContext and the thread-local state that's used by OpenTelemetry itself to propagate tracing information into Java code that is used by a Kotlin coroutine that happens to execute code that was written in Java.

e.g., I handle a request, get the incoming context, have to stash it because I might execute a coroutine that is suspended/resumed across different threads, and subsequently then execute another GRPC call in a Java library, that happens to start, get rescheduled and resume on receiving the response on a different thread, in a possibly different thread pool.

The OpenTelemetry handling for this is quite complex: it must be used as a javaagent so it can actually instrument underlying libraries with the necessary code for handling thread scheduling/context switches in both thread pools (e.g., ForkJoinPool), threads themselves, with cooperative scheduling in application code (e.g., Thread) and Kotlin's coroutine handling with is mostly codegen (e.g., async, suspend fun.)

Finally, in my own Ph.D. work, we did a similar thing to propagate trace identifiers for a dynamic analysis for fault injection, and we quickly ran into a problem that --- not only is the propagation difficult in itself --- but, you also run the risk of running out of header space if you store any (longish?) information when GRPC is run over HTTP2 because of the maximum allowed header size.

1 comments

cmeiklejohn 979 days ago

I'm also not sure what you mean by the context doesn't propagate between containers and/or pods -- GRPC isn't aware of these Docker/Kubernetes aspects at all.

Do you actually mean that unless explicitly propagated to a subsequent downstream RPC the data is dropped? If so, that's by design.

However, most large-scale organizations that are doing distributed tracing (e.g., Twitter, Uber) have either invented, reproduced, or leveraged OpenTelemetry's design for this precise thing.

Naive context propagation isn't (really) the difficult part with most of these designs -- it's what you've done, using an interceptor, reading the data and assigning it automatically on subsequent requests -- the challenge is dealing with this under many different, real world conditions: a.) concurrency and thread scheduling; b.) not all services use the same version of downstream RPC libraries; c.) not all calls are GRPC, and some use HTTP (and, different HTTP libraries, at that.); and d.) you cross message passing boundaries: i.e., I receive request, write to Kafka queue/reliable workflow backend (e.g., Cadence, Temporal) and re-read the request and then execute a subsequent RPC as a result of that message.

If you're using Kotlin, I suspect you will run into these challenges. Tune your thread pools up/down, restrict your JVM's resources, and you'll suddenly see that if the thing that handles the request uses different threads/coroutines/etc. then the code block that issues the downstream RPC, you'll start dropping the context without explicit handling of that case.

In fact, a very simple test case in Java where you use several, concurrently executed CompleteableFuture's that each issue RPCs, in a very small thread pool should be enough to see the issue.

reidbuzby 979 days ago

Thanks for the comment, it seems like you know a lot more about this than I do.

This was a solution that has worked well for my company that averages <1 req/s, so yes I have not tested it under more extreme conditions. This is version 1.0.0 so it is quite new and naive by design. I was posting here to get some feedback on the initial version and see how I can improve it, which you have given me!

Feel free to contribute to the project! It seems like your expertise applies nicely!

anuraaga 979 days ago

It's too bad that there is still no JEP for an official context in Java.

Disclaimer: I wrote the OTel one, and am sad to see yet more context implementations being made, including the OTel one, these really need to all be on the way out.