| Full disclosure, I work at Foxglove right now. Before joining, I spent over seven years consulting and had more than 50 clients during that period. Here are some thoughts: * Combing through the syslogs to find issues is an absolute nightmare, even more so if you are told that the machine broke at some point last night * Even if you find the error, it's not necessarily when something broke; it could have happened way before, but you just discovered it because the system hit a state that required it * If combing through syslog is hard, try rummaging through multiple mcap files by hand to see where a fault happened * The hardware failing silently is a big PITA - this is especially true for things that read analog signals (think PLCs) Many of the above issues can be solved with the right architecture or tooling, but often the teams I joined didn't have it, and lacked the capacity to develop it. At Foxglove, we make it easy to aggregate and visualize the data and have some helper features (e.g., events, data loaders) that can speed up workflows. However, I would say that having good architecture, procedures, and an aligned team goes a long way in smoothing out troubleshooting, regardless of the tools. |
• Correlates syslogs with mcap/bag file anomalies automatically
• Flags when a hardware failure might have begun (not just when it manifests)
• Surfaces probable root causes instead of leaving teams to manually chase timestamps
From your experience across 50+ clients, which do you think is the bigger timesink: data triage across multiple logs/files or interpreting what the signals actually mean once you’ve found them?