Hacker News new | ask | show | jobs
by hn_throwaway_99 601 days ago
While I agree with your primary pain point, I would argue that that really isn't specific to tests at all. It sounds like what you're really saying is that when something goes wrong, it's really difficult to determine which component in a complex system is responsible. I mean, from what you've described (and from what I've experienced as well), you would have the same if not harder problem if a user experienced a bug on the front end and then you had to find the root cause.

That is, I don't think a framework focused on front end testing should really be where the solution for your problem is implemented. You say "This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails." - I would argue what you really need is better log aggregation and system tracing. And I'm not saying this to be snarky (at scale with a bunch of different teams managing different components I've seen that it can be difficult to get everyone on the same aggregation/tracing framework and practices), but that's where I'd focus, as you'll get the dividends not only in testing but in runtime observability as well.

2 comments

Agreed. Is there a good tool you'd recommend for this?
It's been quite some time but New Relic is a popular observability tool whose primary goal (at least the original primary goal I'd say) is being able to tie together lots of distributed systems to make it easier to do request tracing and root cause analysis. I was a big fan of New Relic when I last used it, but if memory serves me correctly it was quite expensive.
"OpenTelemetry and other tools are promising, but again, I’ve never seen good enough infra that puts that all together."

It's a two paragraph comment and you somehow missed it.

I did read it, and I don't understand why you feel the need to be an asshole.

Like I said in my comment, I do think getting everyone on the same page in a large, diverse organization is difficult. That said, it's not rocket science, and it's usually difficult because there aren't organizational incentives in place to actually ensure teams prioritize making system-wide observability work.

FWIW, the process I've seen at more than 1 company is that people bitch about debugging being a pain, they put in a couple half measures to improve things, and then finally it becomes so much of a pain that they say "fine, we need to get all of our ducks in a row", execs make it a priority, and then they finally implement a system-wide observability process that works.

Exactly! I've never seen a 5000+ eng org that have all their ducks in a row when it comes to telemetry. it's one of those things that you can't put a team in charge of it and get results. everyone have to be on the same page which in a big org is hardly the case.