| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Diggsey 2666 days ago

That sounds horrific for whoever is going to be supporting that system...

The last thing I would want in a production environment is to have some 3rd party software monkey-patching the code at runtime.

What happens when: - a bug only occurs (due to timing or some other extremely subtle issue) when this monkey-patching is applied. - there's a bug in the monkey-patching itself (sounds like a fun debugging session!) - a library is accidentally monkey-patched with a slightly different version, or falsely detected as a known library (maybe it is a fork)

Give me statically compiled, reproducible, dependency free, bit-for-bit identical with what has been thoroughly tested in CI, musl binaries any day. That's how you avoid getting woken up at 4am.

This kind of magic should happen at compile time, if at all.

4 comments

kcmastrpc 2666 days ago

We have hundreds of customers (and thousands of engineers) who are willing to make the trade-off.

It's OK that you're not, but I hope you can agree that engineering observability isn't cheap nor easy - and if you're using standard libraries, frameworks, and tooling (and not going way off the rails) we have observed that, for the most part, our agent works as intended.

We always recommend our customers run the agent in their test and integration environments, but you are correct, there are always risks involved. Other than the automation how is this any different then putting a New Relic jar into your Java app, or including a Datadog library? We simply figured out how to do it automatically at runtime.

link

Diggsey 2666 days ago

I definitely value ease of use and zero-setup solutions - the issue for me here is that the situations where I would consider running a service mesh are the same situations where predictability and reproducibility would trump ease of use - namely any kind of production setting where down-time is sufficiently undesirable.

Testing with the agent would certainly help, but then you lose some of the "ease of use" benefits as I expect you would have to run a mini cluster in CI in order to run your agent?

There are few important difference between this and a "normal" dependency:

- Even if the application is fully tested with your agent, it could be something as simple as turning your agent off that could break things.

Hypothetical scenario: multiple instances of the application are running with your agent enabled. Someone decides to turn off monitoring for some reason - nothing bad happens and they go home at the end of the day. Later on, some instances are restarted, or the cluster is re-scaled. Now you have half your cluster on a different code-base and your serialisation breaks because you were doing something silly like using pickle or a java object stream.

- The examples I mentioned in my previous comment would not happen with a normal dependency, because the version of that dependency would already be managed through standard means. If I were to go an look at the code, I would be able to see the actual code that is running, and the exact versions of all dependencies used.

link

kcmastrpc 2666 days ago

Well, I’m guessing a competitor flagged my OP; but I digress, I was just trying to raise awareness to what is actually possible (even if it’s not free, but I’d argue nothing of value is ever free - even open source, you still have to implement and operate it).

Anyways, I feel like we’ve come to an impasse, there is no monitoring solution out there which is bug-free (even opentracing and it’s various implementations have caused performance/stability issues, re: https://github.com/opentracing-contrib/java-spring-web/pull/...)

Regarding CI, our agent has no requirements other than a supported OS - you could be running your integration tests as a bare JVM and our agent would detect, instrument, and monitor it the same way if it were running inside a CRI-O container on K8S (though I’d question why you would run your integration tests in that manner).

IRT your examples, I’ll be brief in my responses because you’re not wrong, but the engineers on our team have taken great care to ensure we don’t break our customers environments (we run on systems which process 10’s of millions of requests an hour and where minutes of downtime cause losses in the 100s of thousands).

We dynamically unload our sensors/instrumentation when the agent is unloaded - so the likelihood of the issue which was mentioned earlier happening is slim (though nothing is impossible)

We also don’t instrument serialization methods (unless you were to decide to use our SDK to do so) so that’d literally never happen. We hook onto methods which handle communication between systems — HTTP request handlers, DB handlers, Messaging System handlers, Schedulers, etc.

Our sensors are open source, so you can check out the code if you’d like (https://github.com/instana). As I said earlier, we live in a world of trade-offs. I’d argue that systems which require the use of a service mesh are significantly complex enough to warrant the use of this level of automation to provide visibility that quite frankly 99.9% of organizations don’t have the time to do themselves.

link

geofft 2666 days ago

I'm on your team, but a lot of people seem to be happy running, say, the HotSpot JIT compiler in production, so I think it's probably fine for most people.

At a certain level of scale you're running code you didn't write, anyway (some mix of open-source code and code from previous team members) and having exact source with exciting surprises you've never seen before isn't going to save you from getting woken up at 4 AM. Though it might make it easier to fix the problem.

link

monkeyoct 2666 days ago

Yeah, it sounds like really cool technology. It doesn't seem so bad if it's on your own software since at least you can test with instana, although the potential for hard-to-debug mayhem is there.

Monkey-patching third party software will totally void the warranty on it. I've been involved in cases like this before and if there's any kind of weird bug that's conceivably related to the monkey-patching, it's hard to get help until you disable it.

link

lcalcote 2666 days ago

Not all, but some other vendor solutions autoinstrument popular languages, frameworks, runtimes as well. Many, many customers leverage these capabilities happily, relying on those engineered solutions (akin to the reliance others place on the engineering put into OC).

link