| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by trelliscoded 1017 days ago

Big tech products are decoupled from the physical world and ultimately not critical for society to function, so the workloads are highly variable. Mainframes are a good fit when you have workloads that don't vary by several orders of magnitude on a daily basis, and you need an ecosystem that treats reliability like a religion.

If you haven't used z/OS or worked in a z/OS shop, you will never understand how fundamentally different mainframe environments are. Sure, JCL is weird and ISPF doesn't make it clear how much raw power and how many OS features are available under the hood, but once you've worked on a real production application maintained by a good team, none of the syntax or UX stuff matters anymore. The abstract representation of the entire application and everything that supports it starts to live in your head and you start to realize how incredibly backwards a lot of modern cloud scale stuff is.

The dirty little secret of a lot of z/OS shops is that they are already using modern cloud environments on their existing hardware. All the productivity gains of Github Copilot + Visual Studio Code, Python, Rails, are all available on the same high speed, high reliability, high capacity frame where you already keep all your data sets. You can even spin up a Linux LPAR if you don't want to run on top of z/OS directly. The node.js people can whip out Next prototypes and slurp in 500 megs of questionable NPM modules on top of either OS, too.

While "modern" environments focus on bolting APIs together using web technologies, a typical mainframe environment simply hosts everything in the same place and uses the data sets on disk or the database as the interface between applications. This allows you to do all kinds of stuff that's fundamentally impossible in a cloud environment. You can instantly give an application a 100% consistent view of a checkpoint of a 200 terabyte data set, while it's being used by 3 other applications pounding out a million IOPS, and remaining failover ready to the other hosts in your metro sysplex. Many sites are sized well enough that when the LTO robot backs it up while all this other stuff is going on, there's not even a blip in your transaction latency numbers.

There is no cloud vendor, and probably will never be a cloud vendor capable of providing support for all of these features, even if they had them in the first place. I've worked on a service that's white box resold by all three major cloud vendors, and believe me when I tell you that there is no way you can get everyone involved in every component that might be contributing to, say, a weird block storage problem on a conference call. You just can't. You'll get escalated to someone who might have a chance at maybe doing some initial isolation to narrow down the scope of what's causing it, but that's ridiculous if you're a bank or a nationwide grocery store. Your EBS call is returning a HTTP 302 today? There's no documentation for that, not at the level you're going to want when it's breaking production at a factory that employs thousands of people. None of those industries use any of these fad technologies for anything important for exactly that reason.

Mainframes, on the other hand, have these kinds of support considerations built into nearly every component of the environment. The error message culture alone eliminates all the frantic googling that most non-mainframe engineers are used to doing when they get error messages. The OS and applications are expected to emit error and warning codes, which translate to actual English text that tell you what happened, and what to do about it. If necessary, IBM support can remotely JTAG any component in the frame after it's been automatically taken out of service.

Your SAN vendor can't point fingers at your server vendor, because they're the same vendor. You can't even call the network vendor for your ToR switch, because it doesn't exist; the frame has multiple internal PCIe, Infiniband, and IBM coupling adapter backbones, and you don't need a switch because you can simply add 96 WAN-capable 10 gig ports to each frame. There's no separate cluster interconnect transceiver vendor for the optics, that's IBM too.

If any of this detailed documentation refers to some operating system data structure you've never heard of, well, those are all documented in the MVS data area manuals, volumes 1-4, with convenient "eyecatcher" human-readable four character strings so you can quickly identify them in a hexdump of the operating system's memory. Nearly every single thing the OS outputs is also documented in the system message manuals, volumes 1-10. Don't even need to bookmark it, everything's immediately available via the master documentation index at ibm.com/docs/zos.

There is absolutely no comparable equivalent to any of these supportability-oriented resources in any other supported software ecosystem that exists today. Just go look at the manual for DFSORT, which is the z/OS equivalent to the Unix sort utility. Seriously, go look at it:

https://www.ibm.com/docs/en/SSLTBW_2.5.0/pdf/icem100_v2r5.pd...

There's over a hundred pages explaining what to do for any of the runtime messages the utility can output. There's no handwavy "undefined behavior" like you get with C and Unix. IBM has thought through every possible thing that can go wrong with a sort utility, with paragraphs of supplemental information and advice for many of them. If you use the online facilities to look it up on the mainframe itself, it will conveniently ask you if you want to print out the relevant documentation on the printer closest to you (yes, it knows where it is.) Most of the 3rd party application software is the same way; this level of documentation and programmer attention to detail is simply part of the culture.

People say IBM's mainframe support is really good, and in my experience it usually has been, especially when you're hard down and they get their best people on the phone immediately, but no one ever mentions that the best thing about their support is that they do all this behind the scenes work so you can support yourself using the documentation, without having to call them. Meanwhile, things which are extremely basic, fundamental, excruciatingly well-documented operations in z/OS continue to be totally impossible in every single other operating system ecosystem, like "is this file in the page cache, and if not, why?" Imagine trying to answer that question without having to break out bpftrace or WinDBG in kernel debugging mode, and even then you'll have to go read the source code or load ntoskrnl into Ghidra to figure out where the data structures are. The worst thing about all this is that a lot of people in the industry think that it's normal and they're patting themselves on the back for being oh so clever at knowing how to do that.

Once you've been exposed to the mainframe way of doing application development, it ruins you for how most of the rest of the industry prioritizes feature velocity and how their focus is growing DAUs for people looking at cat pictures, or have PR campaigns about how it wasn't actually lying, technically speaking, when their other PR campaigns said the car can drive itself, and none of that had anything to do with killing those drivers when they took it at face value. Oh, by the way, the car can fart now, did you know?

All of this narcissistic Silicon Valley tech industry bullshit starts to feel, quite frankly, kind of disrespectful and like you're in a business relationship with juvenile clowns if you're trying to run a business that people are depending on for something that actually matters. Building things for support and for uptime isn't sexy, is bad for growth, and doesn't require a team of rockstar SREs who dress and act like fighter pilots. There is no "module of the week" for CICS. There will never be an "oh-my-TSO" package with emoji and themeable color schemes. If you go watch "For All Mankind," the 1970's era TSO prompts in the TV show look exactly the same as they do when you fire up your 3270 emulator on your iPhone today and remote into your frame. Everything about the mainframe ecosystem is exactly the kind of boring you want when your applications are supporting real people, spending real hard-earned money, on real products and services. That's why the parts of the tech industry that are constantly in the news don't use any of this technology: it works.

2 comments

jason2323 1017 days ago

I've heard similar arguments several times, and every single time these kinds of arguments go up in flames when costs are brought up. Even discounting the "elasticity" that building applications on the cloud, its simply far cheaper from a cost perspective to build on cloud. Folks fail to mention that the cost of having the kind of all-round support from IBM that you mention comes at a significant cost that is unaffordable to most business. Its probably cheaper to debug esoteric Linux issues than to call in IBM support. In fact, IBM knows this and has moved a significant chunk of their support team abroad to reduce costs and remain affordable. In my experience, these support teams are usually of poorer quality.

link

deterministic 1015 days ago

What an excellent answer. Nailed it!

link