Hacker News new | ask | show | jobs
by westurner 462 days ago
Fair benchmarks would justify merging aiopandas into pandas. Benchmark grid axes: aiopandas, dtype_backend="pyarrow", dask-cudf

pandas pyarrow docs: https://pandas.pydata.org/docs/dev/user_guide/pyarrow.html

/? async pyarrow: https://www.google.com/search?q=async+pyarrow

/? repo:apache/arrow async language:Python : https://github.com/search?q=repo%3Aapache%2Farrow+async+lang... :

test_flight_async.py https://github.com/apache/arrow/blob/main/python/pyarrow/tes...

pyarrow/src/arrow/python/async.h: https://github.com/apache/arrow/blob/main/python/pyarrow/src... : "Bind a Python callback to an arrow::Future."

--

dask-cudf: https://docs.rapids.ai/api/dask-cudf/stable/ :

> Neither Dask cuDF nor Dask DataFrame provide support for multi-GPU or multi-node execution on their own. You must also deploy a dask.distributed cluster to leverage multiple GPUs. We strongly recommend using Dask-CUDA to simplify the setup of the cluster, taking advantage of all features of the GPU and networking hardware.

cudf.pandas > FAQ > "When should I use cudf.pandas vs using the cuDF library directly?" https://docs.rapids.ai/api/cudf/stable/cudf_pandas/faq/#when... :

> cuDF implements a subset of the pandas API, while cudf.pandas will fall back automatically to pandas as needed.

> Can I use cudf.pandas with Dask or PySpark?

> [Not at this time, though you can change the dask df to e.g. cudf, which does not implement the full pandas dataframe API]

--

dask.distributed docs > Asynchronous Operation; re Tornado or asyncio: https://distributed.dask.org/en/latest/asynchronous.html#asy...

--

tqdm.dask, tqdm.notebook: https://github.com/tqdm/tqdm#ipythonjupyter-integration

  from tqdm.notebook import trange, tqdm
  for n in trange(10):
      time.sleep(1)
--

But then TPUs instead of or in addition to async GPUs;

TensorFlow TPU docs: https://www.tensorflow.org/guide/tpu