|
From 2003-2005: Co-created our database release process, integrated it with our code release process, developed our production problem management process, and built a bunch of our production monitoring tools. Many of those things have since been superseded in the intervening 15 years, but it still pleases me to walk by the NOC and see tools of mine that I wrote 10-15 years ago still running (now maintained by others, but still running). One of the most useful and longest-lived tools is one of the simplest (I literally built the essence of it in 4 hours, 6-10 PM one evening). It graphs a timeline, 1 second per pixel in X, logarithmic dollar value in Y, plot every order. That was the first version. It's since evolved to have a bunch of per-minute summary data on the screen (AOV, CR%, errors/info/warning/404s, total bookings, paid vs unpaid orders, database connections in use, idle connections available, long-running transactions, long-running pages, etc per minute), records to a database, so you can "playback" outages or go exploring, etc. It's not the best tool for deep digging, but when you want a fast-reacting, "quick check" that the entire site is working post-release or post-outage, it's unambiguous that people are getting all the way through checkout (or not). You might be surprised what you can learn from such a simplistic tool. |