Hey everyone, I've recently created an open source library on top of grpc-kotlin and grpc-java that allows you to propagate a context across microservice boundaries throughout an entire request lifetime. The existing io.grpc.Context (https://grpc.github.io/grpc-java/javadoc/io/grpc/Context.htm...) only propagates a context across API boundaries within the same container and does not cross microservice boundaries.
Propagating security principals, or user credentials and identifiers throughout an entire request lifetime across all of your microservices.
Propogating distributed tracing information. Set a request trace id upon receiving a request and later access that id in any downstream microservice.
However, Konig Kontext is built to support any type of context value, so it can be extended to fit any specific use cases as well.
It avoids polluting your request classes with side channel details that your service may not even be concerned with at all; it makes for nice abstractions.
It allows for the opaque propagation of side channel information, where intermediates don't need to know what's there.
side-channel-with-well-defined-propagation-rules is pretty useful.
Adding hidden data to the request seems counter to the desire for a schema driven rpc. I guess it's a matter of taste whether some data is pollution or simply explicit. You're sending data that the service may or may not be concerned with whether its in the header or the body or some other side channel.
Maybe my reaction is a question of purity. I don't really see why one would think a request body should be in the schema but we should leave other data out of the schema. Wouldn't every single argument point to consistency?
I like gRPC and what it gives you. I personally would like that same explicit schema and type safety to apply to my tracing as well. Its interesting to me that others would draw a line.
Debug/tracing data is a great use case for context, for example. If propagating trace information has to be done manually, then you significantly weaken the utility of cross-service tracing. You'll always be questioning if lack of a trace means it didn't happen, or if it means someone forgot to pass along the trace context.
This applies especially to grpc, where everything is optional.
To be clear, I prefer explicit over implicit. But that doesn't always scale well to large orgs.
I think what's throwing me off is that the examples aren't just transparent things like debug/tracing. They mention credentials and user ids, which the app code can observe and rely on.
The combination of out-of-band/out-of-schema data with APIs to set and get from app code seems like a bad match.
I'd prefer if it was completely invisible to app code and existed only in interceptors.
While I agree, gRPC is based on protobufs, which are incredibly weakly typed. They specifically include mechanisms for passing through mysterious (not in the schema and of unknown type) data intact.
So, I’d argue the OP is using gRPC correctly, and I’d be using it wrong if I hadn’t given up on it long ago. I appreciate things like static type checking, and services that err on the side of rejecting unparsable requests in order to avoid data corruption. Protobufs/gRPC require heroic effort on the part of the RPC handler implementation if you want those things. In particular, it is easier to implement your own serialization format than typecheck the results returned by the APIs protoc emits.
You do get the same explicit schema and type safety with this library. The context values can be typed using a protobuf message: https://github.com/konigsoftware/konig-kontext#protobuf-mess..., or any other type you'd like. Although you are right, it is not explicitly in the request schema but that's kinda the point.
Because every single handler in a long chain of requests will have to explicitly support and propagate this data. The whole idea of a request context is to let you plumb low-cost metadata transparently through a call tree, and decouple your code from that metadata.
Ditto whats been said before. It can be very cumbersome to update multiple request/response types with new data especially if an intermediary service has no use for the data and is simply passing it through.
The header is fixed as konig-kontext-grpc-context [1]. Does that mean that you can only propagate one gRPC context object? Would it be possible to specify your own header string for the key, and have the interceptors support multiple keys? That sounds safer since the header name could both imply both the serialization and content-type, and Protobuf type_uri, etc. And sounds more useful since developers wouldn't need a catch-all type.
"You can only call this RPC while standing on one foot, with your hat cocked to the side. There's no way for you to find this out except calling it in a test and getting a weird error"
I am blown away that this is not already part of gRPC. A lifetime ago, I worked on twitter's version of this sort of thing (thrift/finagle) and I assumed it was standard.
Both the context and message metadata are already part of gRPC. The gRPC system also allows for server [2] and client message interceptors. Essentially this konig-kontext library provides a interceptor implementations, e.g., [3], that uses their hard-coded key for your serializable context that gets read/written from a gRPC header. The context provided by konig-kontext within your code is a wrapper around the existing gRPC Context [1].
The library is convenient for sure, but I feel that if you had a need to propagate context within gRPC, you'd probably already discovered the API and implemented propagation with your own header keys.
OpenTelemetry's Java implementation does this, but it actually does it in a way that non-GRPC things can access this context as well by ensuring that it propagates throughout both the CoroutineContext and the thread-local state that's used by OpenTelemetry itself to propagate tracing information into Java code that is used by a Kotlin coroutine that happens to execute code that was written in Java.
e.g., I handle a request, get the incoming context, have to stash it because I might execute a coroutine that is suspended/resumed across different threads, and subsequently then execute another GRPC call in a Java library, that happens to start, get rescheduled and resume on receiving the response on a different thread, in a possibly different thread pool.
The OpenTelemetry handling for this is quite complex: it must be used as a javaagent so it can actually instrument underlying libraries with the necessary code for handling thread scheduling/context switches in both thread pools (e.g., ForkJoinPool), threads themselves, with cooperative scheduling in application code (e.g., Thread) and Kotlin's coroutine handling with is mostly codegen (e.g., async, suspend fun.)
Finally, in my own Ph.D. work, we did a similar thing to propagate trace identifiers for a dynamic analysis for fault injection, and we quickly ran into a problem that --- not only is the propagation difficult in itself --- but, you also run the risk of running out of header space if you store any (longish?) information when GRPC is run over HTTP2 because of the maximum allowed header size.
I'm also not sure what you mean by the context doesn't propagate between containers and/or pods -- GRPC isn't aware of these Docker/Kubernetes aspects at all.
Do you actually mean that unless explicitly propagated to a subsequent downstream RPC the data is dropped? If so, that's by design.
However, most large-scale organizations that are doing distributed tracing (e.g., Twitter, Uber) have either invented, reproduced, or leveraged OpenTelemetry's design for this precise thing.
Naive context propagation isn't (really) the difficult part with most of these designs -- it's what you've done, using an interceptor, reading the data and assigning it automatically on subsequent requests -- the challenge is dealing with this under many different, real world conditions: a.) concurrency and thread scheduling; b.) not all services use the same version of downstream RPC libraries; c.) not all calls are GRPC, and some use HTTP (and, different HTTP libraries, at that.); and d.) you cross message passing boundaries: i.e., I receive request, write to Kafka queue/reliable workflow backend (e.g., Cadence, Temporal) and re-read the request and then execute a subsequent RPC as a result of that message.
If you're using Kotlin, I suspect you will run into these challenges. Tune your thread pools up/down, restrict your JVM's resources, and you'll suddenly see that if the thing that handles the request uses different threads/coroutines/etc. then the code block that issues the downstream RPC, you'll start dropping the context without explicit handling of that case.
In fact, a very simple test case in Java where you use several, concurrently executed CompleteableFuture's that each issue RPCs, in a very small thread pool should be enough to see the issue.
Thanks for the comment, it seems like you know a lot more about this than I do.
This was a solution that has worked well for my company that averages <1 req/s, so yes I have not tested it under more extreme conditions. This is version 1.0.0 so it is quite new and naive by design. I was posting here to get some feedback on the initial version and see how I can improve it, which you have given me!
Feel free to contribute to the project! It seems like your expertise applies nicely!
It's too bad that there is still no JEP for an official context in Java.
Disclaimer: I wrote the OTel one, and am sad to see yet more context implementations being made, including the OTel one, these really need to all be on the way out.
Microservices require distributed debugging; distributed debugging requires distributed tracing. Just imagine, as I've been trying to push forwards in my own Ph.D. work, that you could debug a process across microservices. This is why we want this; possibly, done a bit more resiliently and thoroughly than the original OP.
…or, I guess Barbara Liskov in 1987, with Argus. And yet, we still seem to debug programs interactively in isolation. Perhaps it’s because all of those systems assumed a system developed in isolation, that didn’t evolve independently and weren’t implemented in different programming languages communicating through different network protocols instead of function/method invocations.
Exactly that. The beauty of gRPC (and similar) is that you can look at the schema and know exactly what the necessary inputs and outputs are. With this addition, you no longer have any idea.
What if you have a nested stack of calls where microservice A calls microservice B which calls ... etc. Then you're looking at the schema of microservice F, and even the source code where it's called in microservice E, and can't figure out how to you can call it to get it to do the same thing. Little do you know, you need to set some things up that currently only A knows how to do.
Imo microservices should be a logical boundary only. Whether they're in the same process/server/cluster etc ideally should be purely based on throughput/latency/scale concerns. Even more ideally the services should be dynamically shuffled around hosts based on load at runtime.
Microservices are only used to scale engineering teams, not software. You can have a monolith that is still deployed in many different capacities or roles to get any optimization benefits. Microservices are needed because large groups aren't good at working together on software without very rigid and well-defined boundaries.
Academics have tried to make this a reality for years. I suggest revisiting Waldo's "A Note on Distributed Computing" and working forwards from there. If you want to go back further, look at Argus, Emerald, and the original Hermes (from DEC.)
> Whether they're in the same process/server/cluster etc ideally should be purely based on throughput/latency/scale concerns. Even more ideally the services should be dynamically shuffled around hosts based on load at runtime.
Think that's called Erlang.
> Imo microservices should be a logical boundary only.
Having option to split it freely is a costly abstraction to deal with, especially if it is cross-language.
I prefer to just leave "cutting lines" in monolith. Well defined modules and relations between them so if some feature needs to be spun off it's not too hard.
As someone who programmed Erlang both professionally and published academically at Erlang venues for a long time, no.
These optimizations "for runtime" are not well supported by Erlang (i.e., cluster performance changes dramatically when behavioral characteristics of message passing switch from local to remote to remote cluster very quickly) and were long discussed in Waldo's paper back in the 90s, dynamic relocation is not supported well (i.e., unless you use global, which falls apart quickly under network anomalies, of which I, and several others, wrote paper(s) about), and the runtime hardly provides any information on introspection on cluster performance.
Sadly, distributed Erlang had the edge on programming distributed systems almost 20 years before they became pervasive, but has since been left to atrophy and hasn't seen any real innovation in quite a long time.
It's called KonigKontext, check it out here: https://github.com/konigsoftware/konig-kontext!
Example use cases include:
Propagating security principals, or user credentials and identifiers throughout an entire request lifetime across all of your microservices. Propogating distributed tracing information. Set a request trace id upon receiving a request and later access that id in any downstream microservice. However, Konig Kontext is built to support any type of context value, so it can be extended to fit any specific use cases as well.
Let me know what you think!