|
|
|
|
|
by chime
3581 days ago
|
|
I've been working on Databricks for a month now and it is a very unique and satisfying learning experience. I generated a billion n-grams (1-6 word) from 300k docs and thinking in RDDs and Dataframes reminds me of how I felt when I first learned functional programming and GPU coding. RDD flatMap solves problems I didn't even know I had. I ran a 'small' cluster of 24 nodes (300GB+ of RAM) and while I know I could have done the same thing with super optimized code on a single machine, not having to worry about each node's performance but instead thinking in pipelines is refreshing. My goal was not to use/make optimal tokenizer or stop-word remover but instead to make sure every part of the chain could be done in parallel. My biggest complaint? Repeatedly having to restart the cluster because nodes stop responding or throw arcane errors. If these reliability fixes were in place last week, I would have easily been 20-25% more efficient with my time. Can't wait till Databricks team deploys this for their users. |
|