| HN Mirror

That’s interesting. We’ve developed a kubernetes-based scheduler that we’ve found better takes into account our custom job priority needs, allows for more strict data isolation between tenants, and a production grade control plane, though the core scheduling could certainly be implemented in something like HTCondor.

Originally, my first instinct was to use Slurm or AWS batch, but started having problems once we tried to multi cloud. We're also optimizing for being able to onboard an arbitrary codebase as fast as possible, so building a custom structure natively compatible with our containers (which are now automatically made from linux machines with the relevant models deployed) has been helpful.