Hacker News new | ask | show | jobs
by vergessenmir 1971 days ago
It maybe an HPC problem but I'm not sure the available solutions come close to k8s in terms of functionality and I'm not talking about scheduling.

I used to work in HPC/Grid but it's been a while but I do remember Condor being clunky even though it had its uses.

And the commercial grid offerings couldn't scale to almost 10k nodes back then (am not sure about now, or if they even exist anymore)

1 comments

Condor is clunky, but still in use in high energy physics, for example (LHC CMS detector data processing).

For greenfield deployments, I would recommend Hashicorp's Nomad before Kubernetes or Condor if your per server container intent is ~1 (bare metal with a light hypervisor for orchestration), but still steer you to Kubernetes for microservices and web-based cookie cutter apps (I know many finance shops using Nomad, but Cloudflare uses it with Consul, so no hard and fast rules).

Disclosure: Worked in HPC space managing a cluster for high energy physics. I also use (free version) Nomad for personal cluster workload scheduling.

I admit that Nomad is a fair middle ground due to its clean DSL and also because of the homogeneity of their workloads.

The team at OpenAI used the k8s api to make extensions around multi-tenancy (across teams) to saturate available allocations, task specific scheduling modifications which were not supported by the k8s scheduler.

I don't know if Nomad has this extensibility. Their plugins were around device plugins and tasks when I last looked at it.