| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ehsantn 1675 days ago

Dask's architecture is unable to handle large scale data processing tasks (e.g. ETL, dataframe operations) because it's just a distributed task scheduler. It works as long as tasks have very little communication across them. But data processing tasks like table join need heavy communication (e.g. shuffle) which requires a true parallel architecture like MPI to be done efficiently. Almost all of Dask's problems mentioned here go back this issue.

Bodo is a new compute engine that brings true parallel computing with MPI to data processing. Bodo is over 100x faster than Dask for large-scale data processing. Forget about speed, are you willing to pay AWS/Azure/GCP 100 times more than necessary (also increasing your carbon footprint)?

Bodo uses a new inferential JIT compiler technology which requires getting used to but it handles actual Pandas APIs (not "Pandas-like").

Disclaimer: I work for Bodo.