| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by elijahbenizzy 1188 days ago

Congratulations! Really excited for you!

I love how you found a niche, valuable problem, built a framework, and are seeing a lot of success. A question (and I'm far from an expert so let me know if the assumptions are wrong):

It seems to me that the federated users have to be coordinated around timing for this to work. Otherwise this could take weeks/lots of slack messages for a single model to train. E.G. one team is having infra issues and doesn't get a job started, the other team is ready but then their lead goes on vacation, etc... In the internal-to-an-organization case this is probably fine (E.G. a hospital where the data has to be separated by patient/cohort), but if there are different teams managing the data then (a) have you seen this problem and (b) do you have tooling to fix it?

1 comments

danieljanes 1187 days ago

Thanks, we're excited too!

Flower tries to automate this as much as it can. In cases where multiple organizations are involved, the workload can run in a fully automated manner if that's fine for all organizations. If a review step is required, that can be integrated (either on the client side or on the server side) - the availability of reviewers will then become the bottleneck for end-to-end latency.

In the long run, we will evolve the permissioning system to allow workloads to be automatically executed if they fall within pre-approved boundaries, or require manual review if they don't. Pre-approved boundaries could, for example, be used to configure a particular combination of models and hyperparemter ranges that are ok to run without additional (manual) approvals.

elijahbenizzy 1187 days ago

Awesome! Makes sense. I think the challenge is going to be coordinating with the various orchestration systems -- timeouts, etc.. Excited to see how you pull it off!