Hacker News new | ask | show | jobs
by throwaway458864 1713 days ago
Assuming you mean the connections between the components - a hodge-podge of different models, tools, techniques. There is no one way to do it, partly because of how different any given system can be from another. Even within software engineering, it really depends on the industry you're in, the application of the software, the stakeholders, the risks.

But generally speaking, most people only track the connections at design time, as an artifact of overall architecture. And this isn't great, because as the system changes (modern software systems change constantly) the entire system development lifecycle is not being re-assessed every time some component changes.

So in the best case, with a Waterfall model, you have very well defined connections in design, and you have to pray that your SDLC validates that design. But most people prefer Agile, which in practice means "I don't need a well defined system! #YOLOEngineering". So everything is built ad-hoc and nobody even attempts to figure out the entire picture. And in that case, Operations may be told to figure it out (they're the ones running it all, so they have the best vantage), and they tend to implement monitoring and distributed tracing that enables cobbling together a picture of how things are actually working. But that's not fed back into teams' designs, it's just used for addressing problems after the fact.

To be specific: you might use ADRs and manually crafted diagrams to map out the connections, or UML, or some other systems diagramming tool/standard. But often that's created only at a certain level of the system, and doesn't dive deep into component interfaces or tolerances/limits or availability. So the full picture can never be seen from one view, and it's almost never the teams themselves mapping it out.

1 comments

That's exactly what I meant. For standardization, does Kubernetes help in that regard? For example when using network rules to whitelist what component is allowed to communicate with what service? I imagine extracting the current rules and building a graph makes discovery easier. No tolerance/limits/throughput or availability data is included though. The approach is also limited to the cluster level, excluding out-of-cluster communication, while having everything in the cluster may not be that secure.
You're spot on, it would provide limited information. In fact, it may be better to use a network monitor to trace network connections and graph that. Old network rules stick around, and so a graph of just the rules would show you connections that may not exist. And network rules are often made of CIDRs or port ranges, so it's not telling you what actual nodes are receiving traffic. If the CIDR and port range includes multiple networks with multiple components each, you don't really know what's connected to what. Distributed tracing is basically that from the application layer (and includes network calls).

Like yourapostasy says, this kind of post-hoc system design can lead to fallacies, and doesn't contribute to the initial design of the system. If you have nothing else to go on, it helps. But your time is probably better spent investing in formal specifications, and then developing components, connections, and all the operational aspects as implementations and validations of the specification.

Many papers have been published about this, spanning from the 70s to the late 90s, talking about the evolution of software systems engineering. After the 2000s, software engineering became more art than science when the Agile Manifesto gave everyone an excuse to stop caring about rigor.

Oh, ho ho. It is so much more than network dependencies. K8s helps somewhat by pointing a possible direction, but this is truly an Alice in Wonderland, "just how deep into the rabbit hole do you want to go?" problem space. Note the following is from the big-org perspective, small organizations don't really have this problem nearly as bad, but might start seeing this more as we all move into the cloud.

IMHO, the declarative configuration management folks have their heart in the right place, but at their level we've already lost a lot of information and are just shoving around peas on the plate. Post hoc systems information capture is always a lossy, imprecise, empirically-driven affair. Service registries are only scratching the tip of the iceberg.

Everyone is afraid to bite the bullet and start Encoding All The Things, because down that path lies religious wars over what to encode and how to express the encoding. Even with a service registry, I lack information on SLO's, SLA's, RTO's, RPO's, planned outages, A/B (and C/D/E/...) state, ownerships of all kinds, responsibilities of all kinds, architecture, deps of all kinds, onboarding steps and constraints, governance gates, decomm steps and constraints, change approval gates, the timing of each of those, and so on. That's just capturing the information; now imagine the insanity of walking that nightmare graph to seek impossible interlocks (which we humans accept by overriding with outages, for example), or figure out just how long it should take to accomplish a given set of related goals.

We currently handle this as an industry through blunt force trauma on the problem space itself, while contorting ourselves as Matrix-like as possible to sustain as little in return upon ourselves in the process, through a hodge podge of techniques, tools, processes, and exasperation. At this point, I'm not exactly certain we'll fully address this space without a Culture Mind-level AI (said tongue in cheek, I really do think there is some promising work being done in this field, it is just a grind).