| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ben509 2511 days ago

> I can see why data science types who use / produce a lot of (typically throwaway) Python code might be enamored with this

I think that's honestly its killer app. I don't do data science, but I often use jupyter to write short prototypes, and make plots and such.

I've been looking for something that lets me:

1. Set up a jupyter notebook.

2. Run some tests or experiments, pulling in libraries as needed.

3. Ignore it for a few years.

4. Come back and run the same notebook without everything breaking.

If this could patch into Python's native import mechanism and let me specify the repo version as todays date, it'd be great.

    import functorflow
    functorflow.repo("2019-08-04")

    from ff.package_name.whatever import foo

If that gets you a replicable set of versions, it'd be pretty great.

1 comments

marcinzm 2511 days ago

Hmmm, that feels more like a jupyter extension than a standalone module. The extension could intercept imports automatically without the need for a prefix. Might require a custom kernel that does some magical per notebook lib mapping. Maybe it installs libs in version specific central folders and then does some symlinks to build a notebook specific lib folder based on the date specified. Feels more seamless but also more complicated.

link

0az 2511 days ago

I'd actually implement it as a transparent venv and use pinned versions in a requirements.txt – with hashes, of course.

link

marcinzm 2511 days ago

That'd make sense. You'd need to pin all transitive dependencies as well and cache the venv. Not sure if you'd need to find older versions of new direct dependencies to avoid conflicts. For example, you run the sheet in 2015 and then again in 2019 but with a new import. That new import's latest version has a transitive dependency that you already pinned but has different version requirements than what you pinned. I wonder if you can embed the requirements.txt inside the sheet itself.

Only issue is that it'd be really space inefficient, my DS venvs clock in at 400+mb each so having one per sheet will probably quickly become unusable. Which is why I thought of some sort of smart system wide caching akin to maven/ivy. But I'd forgotten how complicated python dependencies (binaries, c code, etc.) were and how little api support pip had.

link

ben509 2510 days ago

> You'd need to pin all transitive dependencies as well and cache the venv.

I think to do this stuff, you need hooks in Jupyter to setup and teardown the venv before it runs the kernel. (And generally Jupyter would want to clean up unused venvs to mitigate the teardown hook not firing.)

> Only issue is that it'd be really space inefficient

A venv is overkill. You can just run the kernel in a regular directory, and `pip install --target kernel_dir foo-bar-pkg` to put packages directly in it. As long as the linker sees it, third-party libraries will work. This technique is used in the serverless project[1] to bundle dependencies for use on AWS Lambda.

> Not sure if you'd need to find older versions of new direct dependencies to avoid conflicts.

Curation is a solution to this. Stackage[1] is popular in the Haskell community; they build a consistent version set of everything every night and curate stable releases periodically.

With curation, a date and a set of top-level packages is enough to pin your dependencies.

[1]: https://github.com/commercialhaskell/stackage#frequently-ask... [2]: https://github.com/UnitedIncome/serverless-python-requiremen...

link