Hacker News new | ask | show | jobs
by thadguidry 2284 days ago
Our current architecture is here: https://github.com/OpenRefine/OpenRefine/wiki/Architecture

YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html

2 comments

Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help. You can check my github: https://github.com/rmalla1/OpenRefine-HD
before you go down the spark route, consider perl/unix-tools may do this kind of thing quite well: https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...
That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433