|
|
|
|
|
by ehsantn
1675 days ago
|
|
Dask's architecture is unable to handle large scale data processing tasks (e.g. ETL, dataframe operations) because it's just a distributed task scheduler. It works as long as tasks have very little communication across them. But data processing tasks like table join need heavy communication (e.g. shuffle) which requires a true parallel architecture like MPI to be done efficiently. Almost all of Dask's problems mentioned here go back this issue. Bodo is a new compute engine that brings true parallel computing with MPI to data processing. Bodo is over 100x faster than Dask for large-scale data processing. Forget about speed, are you willing to pay AWS/Azure/GCP 100 times more than necessary (also increasing your carbon footprint)? Bodo uses a new inferential JIT compiler technology which requires getting used to but it handles actual Pandas APIs (not "Pandas-like"). Disclaimer: I work for Bodo. |
|