Hacker News new | ask | show | jobs
by mlhpdx 842 days ago
I don’t know that’s true. My last two very-not-meta-sized companies have both had systems that were very cost effective and essentially what the article describes. It’s not the simplest thing to put in place, but far from unapproachable.

I think on if the big hills is moving to a culture that values observability (or whatever you choose to call it, I prefer forensic debugging). It’s another thing to understand and worry about and it helps tremendously if there are good, highly visible examples of it.

Edit: Typo.

1 comments

Could you share some specifics of how it could be approached?
I don't know what that commentor has in mind. My own experience building this up is to start with usable information and not try to instrument everything at once. Those are usually:

- some way to get to errors when they happen

- zeroing in on the key performance indicators for your application, and relating them to infra metrics, particularly resources (because cpu, mem, storage, and bandwidth costs money).

Unless you have both domain and infra knowledge, it will be hard to know ahead of time.

For a stateless web app backed by a db, you're typically starting with:

- request metrics (req/s, latency)

- authenticated user activity

- db metrics (such as what you'd get with pganalyze)

It's when there are resource pressure that things get interesting. Here, you have product-fit, you have user traction and growth, but now your app is falling down because it is popular.

It is tempting to just crank things up horizontally and say, you're trying to land-grab users ... but your team will never develop the discipline to develop scalable and reliable software. It's here that you start adding instrumentation to find bottlenecks -- whether that is instrumenting spans, adding metrics, optimizing queries, etc. You also need to craft the dashboard to give actionable intelligence. Here's where Datadog's notebook feature is great -- you explore (and collaborate) with the notebook until you can find the bottleneck, and then export the useful metrics into a dashboard. Then you set up the monitoring, because you have found the key performance indicators.

It's this active search to understand what is going on in _both_ app and infra that shows you the limits of the current architectural designs, guide what you need to do, and validate the architectural and engineering decisions for the future. This active search may involve tools beyond OpenTelemetry or Datadog or Honeycomb -- maybe you have to attach a REPL, or go poking around a memory profiler.

What you _don't_ do is blindly adding these things because having the capability somehow makes things better. Rather, you incrementally improve your capability in order to solve your present scalability and reliability problems with your app and its infra.