Hacker News new | ask | show | jobs
by westurner 2415 days ago
This looks excellent. The ability to generate the Python code for the pandas dataframe transformations looks to be more useful than OpenRefine, TBH.

How much work would it be to use Dask (and Dask-ML) as a backend?

I see the OneHotEncoder button. Have you considered integration with Yellowbrick? They've probably already implemented a few of your near-future and someday roadmap items involving hyperparameter selection and model selection and visualization? https://www.scikit-yb.org/en/latest/

This video shows more of the advanced bamboolib features: https://youtu.be/I0a58h1OCcg

The live histogram rebinning looks useful. Recently I read about a 'shadowgram' / ~KDE approach with very many possible bin widths translucently overlaid in one chart. https://stats.stackexchange.com/questions/68999/how-to-smear...

Yellowbrick also has a bin width optimization visualization in yellowbrick.target.binning.BalancedBinningReference: https://www.scikit-yb.org/en/latest/api/target/binning.html

Great work.

3 comments

Thank you for your feedback and support :) Are you currently using OpenRefine?

We are currently thinking about providing other dataframe libraries like dask or pyspark and similar. However, we are a little bit unsure on how to make sure that there is user demand before we implement it. It is not a complete rewrite but it would require some additional abstractions at some points in the library. And we need to check if some features might not be available any more. Would dask support be a reason to buy for you?

Great hint with yellowbrick and yes, we are considering some of those features as well if there is a useful place in the library.

In general, we are also thinking about ways how you can extend the library for yourself so that you can add your own analyses/charts of choice and then they will come up again the right point in time. In case that this is useful.

In the past, I've looked at OpenRefine and Jupyter integration. Once I've learned to do data transformation with pandas and sklearn with code, I'll report back to you.

Pandas-profiling has a number of cool descriptive statistics features as well. https://github.com/pandas-profiling/pandas-profiling

There's a new IterativeImputer in Scikit-learn 0.22 that it'd be cool to see visualizations of. https://twitter.com/TedPetrou/status/1197150813707108352 https://scikit-learn.org/stable/modules/impute.html

A plugin model would be cool; though configuring the container every time wouldn't be fun. Some ideas about how we could create a desktop version of binderhub in order to launch REES-compatible environments on our own resources: https://github.com/westurner/nbhandler/issues/1

The UI is heavily inspired by the one from Trifacta/Cloud Dataprep. (i.e. histograms when selecting columns, brushing to start a transformation...)

I guess that makes it easy to get started with pandas (and learn about the pandas api). I wonder how some advanced transforms such as join/union/pivot will look like?

Yes, we looked at many different tools for inspiration and Trifacta was among them.

For join in action, you can watch this video: https://www.youtube.com/watch?v=r59Q19oCMr8&t=3s

We also support pivot and melt. About union: what do you have in mind here?

Dask has only a subset of Pandas available.
Could you send me a link to the docs where they say which ones are not included in Pandas? Would love to take a closer look at his.
Set difference and/or intersection of dir(pd.DataFrame) and dir(dask.DataFrame) with inspect.getargspec and inspect.doc would be a useful document for either or both projects.

pyfilemods generates a ReStructuredText document with introspected API comparisons. "Identify and compare Python file functions/methods and attributes from os, os.path, shutil, pathlib, and path.py" https://github.com/westurner/pyfilemods