| I miss the powerful metrics and logging systems that I used in Amazon. > Processing/streaming logs to get metrics is a terrible waste of time, energy and money.
> Spend that producing high quality metrics directly from the apps Absolutely not. Most application metric systems generate metrics as text strings with a simple format that is parsed by the metric collector. This is what we also call a structured log. Parsing such text strings takes very little CPU. All logs and metrics represent events. A good approach is to prefer numerical values where possible, but only for quantities that are comparable. Metrics are for the "how many?" question. But never forget to log text events, because you need to answer the "what happened?" question. Don't be afraid of generating too many different metrics but avoid too frequent datapoints and unnecessary verbosity in logs. Never dump complex objects "just in case". Treat overlogging and underlogging as a bug. Spend time every day in reviewing the metric dashboards and improve them constantly. If it takes more that 10 seconds do add a new non-obvious chart (e.g. to calculate a ratio between 2 metrics or a percentile or other computation) throw away your charting system. Lying with numbers is very easy: always look at distributions, not just instant values. Some metrics must be represented as percentiles and min/avg/max are meaningless. Percentiles are good for ignoring meaningless outliers, but always count the outliers to ensure that you are not ignoring meaningful data. Especially during incidents. Metrics and text logs tell a story together. Process, correlate and visualize them together as much as possible. |