| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thadguidry 2284 days ago
	Our current architecture is here: https://github.com/OpenRefine/OpenRefine/wiki/Architecture YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html

2 comments

ratnakar007 2284 days ago

Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help. You can check my github: https://github.com/rmalla1/OpenRefine-HD

link

Chris2048 2284 days ago

before you go down the spark route, consider perl/unix-tools may do this kind of thing quite well: https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...

link

thadguidry 2284 days ago

That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433

link