| > At one point we attempted to migrate to Heroku Shield to address some of these issues, but we found that it wasn’t a good fit for our application. This part seems very hand wavy, given that Heroku Shield would've solved many (all?) of their problems. > We were also running into limitations with Heroku on the compute side: some of our newer automation-based features involve running a large number of short-lived batch jobs, which doesn’t work well on Heroku (due to the relatively high cost of computing resources). How much memory did their batch jobs actually need? If they're using Rails, then I'm assuming they're just running a bunch of Sidekiq jobs that are querying PG. I'm surprised that they'd need that much in terms of compute resources. They should be able to get very, very far by making PG do a lot of the work, or by streaming data from PG and not holding a lot of data in memory. Even if they did need all this, the following two options seem WAY easier to manage: 1) Use dokku to run your super-intense Sidekiq batch jobs on beefy EC2 instances. You can still schedule them in your Rails app in Heroku, no big deal. Many engineering teams have to do this type of split-up anyway when it comes to Application Engineers and Data Engineers, this is just a simpler way to do it. 2) Similar to 1), use a different language runtime for the batch jobs. If you really need to run CPU intensive jobs, why are you using Ruby? If the jobs aren't so intense to mandate maintaining two languages (fwiw, not that hard), why will moving to k8s solve the issue? Personally, I'm not sold on their decision to move to Kubernetes, and I use Kubernetes for my job. |
Author here; I don’t want to go into too much detail, but we tried Shield early on and had a negative experience that made us wary about using the platform (it seems to use a different tech stack under the hood from “normal” Heroku and lacks a lot of the things that make Heroku great). Also it’s very expensive compared to VPC-based solutions on AWS and GCP.
W.R.T. the batch jobs, I think I didn’t explain super well—we are using a different language and runtime from our “normal” background processing jobs (which use worker queues in Rails), it’s just that Heroku isn’t very well suited for the use case (which is basically FaaS-like but with long-lived jobs).
The “split” workflow you described is basically what we were doing (but with AWS Batch instead of Dokku); it’s just that it’s more cost-efficient to consolidate everything into one cluster (especially with preemptible gke nodes) and also better to have a common set of tooling for the Ops team.
To be fair, we haven’t yet completed the move from Batch to k8s so it’s possible that part of the plan won’t pan out as expected.