Hacker News new | ask | show | jobs
by thisisnotmyname 5122 days ago
From the description, this sounds quite over-engineered. You need a system to queue jobs and to automatically restart jobs that fail? I must be missing something here, can you help me figure out what it is?
1 comments

This was my initial idea. However, consider this scenario. I need to boot a web stack (db master, pool of db slaves, webservers and a load balancer). I boot a master, job succeeds, then I boot slaves, job succeeds. I boot webservers, but by this time master is shut down or becomes unavailable. Because of a job-based approach, this will never be detected and fixed by the controller. The chosen approach would detect the state of the environment and determine that a master is missing and proceed to correct the situation or abort the process altogether. Hope this makes sense!
Fundamentally, each of the tasks is a separate job. Why not just order the jobs in an array and start them in order? So if a job at some index needed to be restarted/started for some reason, you would first go through and start everything before it.

More generally, looks like you just form a DAG to map out the dependencies and use it to figure out what to do. The daemon could then periodically traverse the DAG starting from the root to each leaf starting jobs as required. Could you explain why this kind of approach was unfeasible in your scenario?

Well, success of a job is transient and not permanent in this case. Like I mentioned before, successful boot of a dependency doesn't mean that dependency exists by the time we get to boot a host. You need to continuously check the status of your group to determine actions that can be taken for current state. A stateful system that marks a job as complete upon successful execution wouldn't work. DAG computation does happen, but inside MoreHostsCanBeBooted condition that is a pre-requisite for BootMoreHosts action. If what you're proposing is to re-run the same job until target state transition has been achieved and code job in idempotent way, then this is essentially what's being done with the current approach, except there are no jobs, and idempotency is a side-effect of not tracking progress. I hope my explanations make sense :)