Yep. Some similarity to TFDV too, but the UI here looks to be more or less lifted directly from Trifacta/Cloud Dataprep.
pro:
- Trifacta can be slow, and part of that might be the way it stores the data (I'm assuming js data structures); if so Pandas/Bamboolib could improve that.
con:
- Trifacta/Cloud Dataprep is directly integrated with Cloud Dataflow and can handle jobs that would crash Pandas.
Thank you for pointing out TFDV (Tensorflow Data Validation) - I had not seen it so far.
And yes, as I say in the video, we used the Trifacta Wrangler Free Version to illustrate the vision of what we aspire to build. In the end, it will look different of course and we have some ideas on where we would imagine a completely different user interface. If this will be better or worse remains to be seen..
And thank you for the comparison of Trifacta and pandas. And I agree, that pandas won't be able to handle any dataset size. However, I wonder if the data set size can be increased if we also work in the cloud on machines with a larger RAM. Or, maybe even export Dask code instead of pandas code.
So, you seem to have experience working with Trifacta Wrangler. Is there something that you don't love about their solution?
pro: - Trifacta can be slow, and part of that might be the way it stores the data (I'm assuming js data structures); if so Pandas/Bamboolib could improve that.
con: - Trifacta/Cloud Dataprep is directly integrated with Cloud Dataflow and can handle jobs that would crash Pandas.