Hacker News new | ask | show | jobs
by vasinov 2142 days ago
This looks cool! A couple of questions:

1. Currently, if I install something in the notebook, does it get re-installed every time the pipeline is run? Is there any way to "snapshot" the state of the container?

2. Where is the data stored between the steps?

3. How well-integrated is it with AWS cloud primitives such as EC2 instances, EFS, and S3?

1 comments

Thanks!

1. Right now additional dependencies for the container need to be re-installed whenever you run the pipeline. During the entire Jupyter kernel session though, the container state and thus any installed dependencies remain available. We're working on either supporting container snapshots or custom container images (with desired dependencies pre-installed). We'll likely go with snaphots as they'll be easier from an end-user perspective.

2. During step execution data is stored inside of either the pipeline directory (which contains for example the .ipynb/.py/.R/.sh files) or in any of the mounted directories (through data sources).

When you run the pipeline as part of an experiment a copy is created so that any state generated by any of the steps inside of the pipeline directory is isolated from the 'working copy' of the pipeline.

Edit: forgot to mention that we support memory-based data transfer between steps which is faster and doesn't "pollute" your pipeline directory. It does require your data to fit in memory though. We use Apache Arrow's Plasma for this.

3. AWS S3 and AWS Redshift are currently supported as data sources. Some light docs at https://orchest-sdk.readthedocs.io/en/latest/python.html#dat... (to be improved!) and the relevant SDK source (https://github.com/orchest/orchest-sdk/blob/master/python/or...). We should look into EFS. Do you have a use case in mind?