|
|
|
|
|
by washedDeveloper
168 days ago
|
|
The org I work on develops HTCondor. We have a lot of scientists that end up running alphafold and other bio related models on our pool of GPUs and CPUs. I am curious to know how and why your team implemented yet another job scheduler. HTCondor is agnostic to the software being ran, so maybe there is more clever scheduling you can come up with. That being said, HTCondor also has pretty high flexibility with regards to policy. |
|
Originally, my first instinct was to use Slurm or AWS batch, but started having problems once we tried to multi cloud. We're also optimizing for being able to onboard an arbitrary codebase as fast as possible, so building a custom structure natively compatible with our containers (which are now automatically made from linux machines with the relevant models deployed) has been helpful.