Hacker News new | ask | show | jobs
by aniketpanjwani 2396 days ago
This looks exciting! I'll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.

1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?

2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.

3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?

4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.

1 comments

1. Metaflow should best help when there is an element of collaboration - so small to medium team of data scientists. Collaborating with your self is also another scenario when Metaflow can be useful since it takes care of versioning and archiving various artifacts.

2. Keeping the language pythonic, without any additional need to learn a DSL has definitely been key to Metaflow's adoption internally. That said, this is something we are open to hearing back, esp. with this OSS launch.

3. Yes - definitely think so. Personally my favorite is the local prototyping experience part; when everything can fit in memory and is blazing fast. There is an also an open issue for fast-data access, which you can upvote if interested in seeing it open-sourced.

4. We don't think there is an exact equivalent as well. :)

Re 4, aren't Kubeflow and Lyft's recently open-sourced "Flyte" pretty similar?

If you don't consider them basically equivalent, what would you say are the key differences?

Thanks for pinging on this.

re: Kubeflow - imho it is quite coupled to Kubernetes. We don’t intend to be tied to a specific compute substrate even though the first launch is with AWS. We do follow a plugin architecture - so I’m hoping Kube happens sometime.

re: Flyte - I’m less informed on this but happy to educate myself and get back.

Good overview of Flyte found here. https://www.youtube.com/watch?v=KdUJGSP1h9U It does appear to be quite similar, though it has native k8s integration and a central web-based UI for monitoring jobs. Flyte asks the user to turn on caching. I like that Metaflow does that for you by default.
That's true of Kubeflow. I'm not sure that project will be as keen on being as "compute substrate" agnostic as Metaflow too, given its connection with Google.

If you feel inclined jump in the Flyte Slack and share your thoughts :). At my company we're on Kubeflow/Argo now, but things are developing quite a lot in this space so keen to not be myopic.

Thanks for sharing the context. Hopefully we can have a (fast) follow up with Kube integration depending on demand.
Can you specifically compare Metaflow to DVC and Databricks MLFlow? Those seem to be some popular tools in this space right now?
re: 3. We have an optimized S3 client as part of this release - https://docs.metaflow.org/metaflow/data#data-in-s-3-metaflow...