Hacker News new | ask | show | jobs
by armitron 2511 days ago
Absolutely terrible. I can see why data science types who use / produce a lot of (typically throwaway) Python code might be enamored with this, but as far as the Python language goes, functorflow will be another nail in its coffin if it ever gains mass-acceptance. It will turn the already messy and inconsistent ecosystem into shambles that no sane person will want to build anything robust on top of.

With the mass migration of Python developers into Go, this is the last thing that Python needs.

3 comments

> I can see why data science types who use / produce a lot of (typically throwaway) Python code might be enamored with this

I think that's honestly its killer app. I don't do data science, but I often use jupyter to write short prototypes, and make plots and such.

I've been looking for something that lets me:

1. Set up a jupyter notebook.

2. Run some tests or experiments, pulling in libraries as needed.

3. Ignore it for a few years.

4. Come back and run the same notebook without everything breaking.

If this could patch into Python's native import mechanism and let me specify the repo version as todays date, it'd be great.

    import functorflow
    functorflow.repo("2019-08-04")

    from ff.package_name.whatever import foo
If that gets you a replicable set of versions, it'd be pretty great.
Hmmm, that feels more like a jupyter extension than a standalone module. The extension could intercept imports automatically without the need for a prefix. Might require a custom kernel that does some magical per notebook lib mapping. Maybe it installs libs in version specific central folders and then does some symlinks to build a notebook specific lib folder based on the date specified. Feels more seamless but also more complicated.
I'd actually implement it as a transparent venv and use pinned versions in a requirements.txt – with hashes, of course.
That'd make sense. You'd need to pin all transitive dependencies as well and cache the venv. Not sure if you'd need to find older versions of new direct dependencies to avoid conflicts. For example, you run the sheet in 2015 and then again in 2019 but with a new import. That new import's latest version has a transitive dependency that you already pinned but has different version requirements than what you pinned. I wonder if you can embed the requirements.txt inside the sheet itself.

Only issue is that it'd be really space inefficient, my DS venvs clock in at 400+mb each so having one per sheet will probably quickly become unusable. Which is why I thought of some sort of smart system wide caching akin to maven/ivy. But I'd forgotten how complicated python dependencies (binaries, c code, etc.) were and how little api support pip had.

> You'd need to pin all transitive dependencies as well and cache the venv.

I think to do this stuff, you need hooks in Jupyter to setup and teardown the venv before it runs the kernel. (And generally Jupyter would want to clean up unused venvs to mitigate the teardown hook not firing.)

> Only issue is that it'd be really space inefficient

A venv is overkill. You can just run the kernel in a regular directory, and `pip install --target kernel_dir foo-bar-pkg` to put packages directly in it. As long as the linker sees it, third-party libraries will work. This technique is used in the serverless project[1] to bundle dependencies for use on AWS Lambda.

> Not sure if you'd need to find older versions of new direct dependencies to avoid conflicts.

Curation is a solution to this. Stackage[1] is popular in the Haskell community; they build a consistent version set of everything every night and curate stable releases periodically.

With curation, a date and a set of top-level packages is enough to pin your dependencies.

[1]: https://github.com/commercialhaskell/stackage#frequently-ask... [2]: https://github.com/UnitedIncome/serverless-python-requiremen...

Brother, those "data science types" are also people. Be respectful of other people.
Their comment seemed respectful to me. It was pointing out that some users of the platform have priorities other than maintainability and reproducibility.
I mean, data science is really the thing that people should be using Python for.
What? Python is arguably the go-to language in data (and other) science. With a massive ecosystem of mature, powerful libraries like numpy and scipy, you immediately and freely have access to a range of features, managed almost trivially with pip/virtualenv/conda. I don't think any other language has a larger number of performant ML libraries like tensorflow and pytorch, and more importantly, if you want source from the latest ML papers on arxiv, it will almost always be implemented in python.

Then the language itself offers trivial iteration, possibly in real time with tools like jupyter, and requires substantially less bootstrapping knowledge than something like C/C++, which creates a low barrier of entry for non-programmers, who are less likely to shoot themselves in their feet with memory management and such. And if you're concerned about performance, most of the standard data science packages are just wrapped C/C++ anyway.

Sure, it's not perfect, but it's practically the lingua franca of data science right now, and it fits the role quite well.

"I mean, data science is really the thing that people should be using Python for."

Did you read my comment?

I think your comment is coming over sarcastic.