| There's a difficult distinction here, you're right. Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea. What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do. I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting. Which, it is: at very small scale and with best effort reliability. It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems. |