|
OpenTelemetry's Java implementation does this, but it actually does it in a way that non-GRPC things can access this context as well by ensuring that it propagates throughout both the CoroutineContext and the thread-local state that's used by OpenTelemetry itself to propagate tracing information into Java code that is used by a Kotlin coroutine that happens to execute code that was written in Java. e.g., I handle a request, get the incoming context, have to stash it because I might execute a coroutine that is suspended/resumed across different threads, and subsequently then execute another GRPC call in a Java library, that happens to start, get rescheduled and resume on receiving the response on a different thread, in a possibly different thread pool. The OpenTelemetry handling for this is quite complex: it must be used as a javaagent so it can actually instrument underlying libraries with the necessary code for handling thread scheduling/context switches in both thread pools (e.g., ForkJoinPool), threads themselves, with cooperative scheduling in application code (e.g., Thread) and Kotlin's coroutine handling with is mostly codegen (e.g., async, suspend fun.) Finally, in my own Ph.D. work, we did a similar thing to propagate trace identifiers for a dynamic analysis for fault injection, and we quickly ran into a problem that --- not only is the propagation difficult in itself --- but, you also run the risk of running out of header space if you store any (longish?) information when GRPC is run over HTTP2 because of the maximum allowed header size. |
Do you actually mean that unless explicitly propagated to a subsequent downstream RPC the data is dropped? If so, that's by design.
However, most large-scale organizations that are doing distributed tracing (e.g., Twitter, Uber) have either invented, reproduced, or leveraged OpenTelemetry's design for this precise thing.
Naive context propagation isn't (really) the difficult part with most of these designs -- it's what you've done, using an interceptor, reading the data and assigning it automatically on subsequent requests -- the challenge is dealing with this under many different, real world conditions: a.) concurrency and thread scheduling; b.) not all services use the same version of downstream RPC libraries; c.) not all calls are GRPC, and some use HTTP (and, different HTTP libraries, at that.); and d.) you cross message passing boundaries: i.e., I receive request, write to Kafka queue/reliable workflow backend (e.g., Cadence, Temporal) and re-read the request and then execute a subsequent RPC as a result of that message.
If you're using Kotlin, I suspect you will run into these challenges. Tune your thread pools up/down, restrict your JVM's resources, and you'll suddenly see that if the thing that handles the request uses different threads/coroutines/etc. then the code block that issues the downstream RPC, you'll start dropping the context without explicit handling of that case.
In fact, a very simple test case in Java where you use several, concurrently executed CompleteableFuture's that each issue RPCs, in a very small thread pool should be enough to see the issue.