Hacker News new | ask | show | jobs
by stcroixx 1288 days ago
I worked on a system deployed on AIX for a few years. We used it to distribute batch workloads in parallel across a cluster of machines - no other software and OS did such a thing, maybe still doesn't. The machines themselves were PowerPC RS6000's which only ran AIX. The company already had a close relationship with IBM because they'd been running mainframes for decades. They made tons of money so saving money on licence costs was not important.
2 comments

> distribute batch workloads in parallel across a cluster of machines - no other software and OS did such a thing, maybe still doesn't

What am I missing?

One of my college courses ages ago was about working with MPI which we got to run on the hpc cluster.

The last time I needed to run N copies of something I asked kubernetes to do it. The time before that, I asked $cloud_vendor for N identical VMs with the same cloud-init script.

Supposedly Google's in-house stuff that kubernetes and map-reduce (the product, not the concept) are public versions of, is all about running stuff well on huge groups of machines.

I can't speak for OP or their particular use case (because I don't know about it), but kubernetes is not a panacea for large scale batch processing.

For example, large banks need to securely, reliably, and very efficiently process an unfathomable amount of transactions[1]. In this case, kubernetes would be a giant waste of resources and complexity. The former one hampers throughput, the latter one means security and reliability suffer.

For people not familiar with it (me included, actually), it can be mind-boggling what throughput is achieved, and what mechanisms for reliability are in place. Not just in software, in actual hardware; this goes way beyond ECC memory.

[1] Transactions in the bank sense, not in the computer sense, because I don't want to confuse matters more. In the mainframe world for example, there is a difference between "batch processing" and "online transaction processing", but both could be applied to bank transactions. Note that I'm not advocating for the mainframe world here.

Interesting, can you give rough order of magnitude for the txn/s throughout achieved? Would also be really interested on more info or pointers to the hardware reliability mechanisms!
MPP was the acronym for the pattern. It let us take a binary executable which would normally just be a single PID on one node and it would distribute the PID and all its resources including files across however many nodes you had configured. The software that managed this was heavily dependent on RISC processor architecture which basically meant only AIX RS6000. Googles solution is designed for commodity processors. The modern equivalents let you do something sort of similar, but you have to design for that and as far as I know you’re not sharing resources like open file handles across the cluster. We took C programs and ran them as is.
Hey there! It depends on exactly how transparent you're talking, but I rewrote some Fortran and Honeywell assembly to 'C' on lowest-cost-bidder unix workstations on behalf of NASA in 91 or 92? We used NQS [1] to distribute the workload across the nodes and it worked pretty well. Well enough for NASA to retire the mainframe. The idea is hardly unique or novel - I believe DEC's clustering software allowed something similar? [2]

[1] https://gnqs.sourceforge.net/docs/papers/mnqs_papers/origina...

[2] https://www.parsec.com/wwwDocuments/ClusterLoadBalancing.pdf

Popular cross platform proprietary tooling for this is Control-M, but I didn't work personally with it. There are now also batch systems with executors on k8s or Mesos, and in a simplified sense apache airflow is often used for this