And the idea here is that you have a large parallelizeable problem but you don't need the Hadoop ecosystem? To put it another way, why not just use EMR?
This. As someone who works on bioinformatics workflows one of the more difficult aspects of my job is to try to explain to other software folks why we do what we do. While scheduling is a solved problem the issue is that you're dealing with a ton of command line tools expecting POSIX filesystems and often not involving parallelization
Quite a lot of life science work is stand-alone programs that are domain-specific and read and write flat-files.