Hacker News new | ask | show | jobs
by nerdponx 898 days ago
Not OP but R data.table + dplyr is an unbeatable combo for data processing. I handily worked with 1bn record time series data on a 2015 MBP.

The rest of the tidyverse stuff is OK (like forcats), but the overall ecosystem is a little weird. The focus on "tidy" data itself is nice up to a point, but sometimes you want to just move data around in imperative style without trying to figure out which "tidy verb" to use, or trying to learn yet another symbol interpolation / macro / nonstandard eval system, because they seem to have a new one every time I look.

Pandas is a real workhorse overall. Data.table is like a fast sports car with a very complicated engine, and Pandas is like a work van. It's a little of everything and not particularly excellent at anything and that's ok. Also its index/multiindex system is unique and powerful. But data.table always smoked it for single-process in-memory performance.

Until DuckDB and Polars, there was no Python equivalent of data.table at all. They're great when you want high performance, native Arrow (read: Parquet) support, and/or an interface that feels more like a programming library than a data processing tool. If you're coming from a programming background, or if you need to do some data processing or analytics inside of production system, those might be good choices. The Polars API will also feel very familiar to users of Spark SQL.

For geospatial data, Pandas is by far superior to all options due to GeoPandas and now SpatialPandas. There is an alpha-stage GeoPolars library but I have no idea who's working on it or how productive they will be.

If you had to learn one and only one, Pandas might still be the best option. Python is a much better general-purpose language than R, as much as I love R. And Pandas is probably the most flexible option. Its index system is idiosyncratic among its peers, but it's quite powerful once you get used to using it, and it enables some interesting performance optimization opportunities that help it scale up to data sets it otherwise wouldn't be able to handle. Pandas also has pretty good support for time series data, e.g. aggregating on monthly intervals. Pandas also has the most extensibility/customizability, with support for things like custom array back ends and custom data types. And its plotting methods can help make Matplotlib less verbose.

I've never gotten past "hello world" with Julia, not for lack of interest, but mostly for lack of time and need. I would be interested to hear about that comparison as well.

2 comments

At a previous job, I regularly worked with dfs of millions to hundreds of millions of rows, and many columns. It was not uncommon for the objects I was working with to use 100+ GB ram. I coded initially in Python, but moved to Julia when the performance issues became to painful (10+ minute operations in Python that took < 10s in Julia).

DataFrames.jl, DataFramesMeta.jl, and the rest of the ecosystem are outstanding. Very similar to pandas, and much ... much faster. If you are dealing with small (obviously subjective as to the definition of small) dfs of around 1000-10000 rows, sticking with pandas and python is fine. If you are dealing with large amounts of real world time series data, with missing values, with a need for data cleanup as well as analytics, it is very hard to beat Julia.

FWIW, I'm amazed by DuckDB, and have played with it. The DuckDB Julia connector gives you the best of both worlds. I don't need DuckDB at the moment (though I can see this changing), and use Julia for my large scale analytics. Python's regex support is fairly crappy, so my data extraction is done using Perl. Python is left for small scripts that don't need to process lots of information, and can fit within a single terminal window (due to its semantic space handicap).

That's a nice endorsement, I've always liked the idea of Julia as an R replacement. I'll definitely give that a shot when I have a chance.

Is there any kind of decent support for plotting with data frames? Or does Plots.jl work with it out of the box?

Plots.jl can work. You may also want to try https://makie.org/ and https://github.com/TidierOrg/TidierPlots.jl
Ha I like your description of pandas as a work van. I totally have that same feel for it. It's great because it works, not because it's great :)