Hacker News new | ask | show | jobs
by elijahbenizzy 1188 days ago
Congratulations! Really excited for you!

I love how you found a niche, valuable problem, built a framework, and are seeing a lot of success. A question (and I'm far from an expert so let me know if the assumptions are wrong):

It seems to me that the federated users have to be coordinated around timing for this to work. Otherwise this could take weeks/lots of slack messages for a single model to train. E.G. one team is having infra issues and doesn't get a job started, the other team is ready but then their lead goes on vacation, etc... In the internal-to-an-organization case this is probably fine (E.G. a hospital where the data has to be separated by patient/cohort), but if there are different teams managing the data then (a) have you seen this problem and (b) do you have tooling to fix it?

1 comments

Thanks, we're excited too!

Flower tries to automate this as much as it can. In cases where multiple organizations are involved, the workload can run in a fully automated manner if that's fine for all organizations. If a review step is required, that can be integrated (either on the client side or on the server side) - the availability of reviewers will then become the bottleneck for end-to-end latency.

In the long run, we will evolve the permissioning system to allow workloads to be automatically executed if they fall within pre-approved boundaries, or require manual review if they don't. Pre-approved boundaries could, for example, be used to configure a particular combination of models and hyperparemter ranges that are ok to run without additional (manual) approvals.

Awesome! Makes sense. I think the challenge is going to be coordinating with the various orchestration systems -- timeouts, etc.. Excited to see how you pull it off!