| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kremi 662 days ago
	Pandas has been working fine for me. The most powerful feature that makes me stick to it is the multi-index (hierarchical indexes) [1]. Can be used for columns too. Not sure how the cool new kids like polars or ibis would fare in that category. [1] https://pandas.pydata.org/docs/user_guide/advanced.html#adva...

3 comments

cpcloud 662 days ago

Multi-indexes definitely have their place. In fact, I got involved in pandas development in 2013 as part of some work I was doing in graduate school, and I was a heavy user of multi-indexed columns. I loved them.

Over time, and after working on a variety of use cases, I personally have come to believe the baggage introduced by these data structures wasn't worth it. Take a look at the indexing code in pandas, and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning. The maintenance cost alone is quite high.

We don't plan to ever support multi-indexed rows or columns in Ibis. I don't think we'd fare well _at all_ there, intentionally so.

link

kremi 662 days ago

> Take a look at the indexing code in pandas

As the end-user, not quite my concern.

> and the staggering complexity of what's possible to put inside square brackets and how to decipher its meaning

I might not be aware of everything that's possible -- the usage I have of it doesn't give me an impression of staggering complexity. In fact I've found the functionality quite basic, and have been using pd.MultiIndex.from_* quite extensively for anything slightly more advanced than selecting a bunch of values at some level of the index.

link

hansvm 662 days ago

> As the end-user, not quite my concern.

Complicated code is (probabilistically) slow, buggy, infrequently updated code. By all means, if it looks like a good enough tool for the job (especially if the alternatives don't) then use it anyway, but that's slightly different from it not being your concern.

I've seen enough projects need "surprise" major revisions because some team tried to sneak a dataframe into a 10M QPS service that my default is keeping pandas far away from anything close to a user-facing product.

I've also seen costs balloon as the data's scale grows beyond what pandas can handle, but basically all the alternatives suck for myriad reasons, so I don't try to push "not pandas" in the data backend. People can figure out what works for themselves, and I kind of like just writing it from scratch in a performant language when I personally hit that bottleneck.

link

jononor 662 days ago

I work a lot with IoT data, where basically everything is multi-variate time-series from multiple devices (at different physical locations and logical groupings). Pandas multi index is very nice for this, at least having time+space in the index.

link

highfrequency 662 days ago

Is your workload mostly single-threaded? If so, is that due to dataset size, or machine core count?

link

kremi 662 days ago

Sorry I don't know what to answer. I don't think what I do qualifies as "workload".

I have a process that generates lots of data. I put it in a huge multi-indexed dataframe that luckily fits in RAM. I then slice out the part I need and pass it on to some computation (at which point the data usually becomes a numpy array or a torch tensor). Core-count is not really a concern as there's not much going on other than slicing in memory.

The main gain I get of this approach is prototyping velocity and flexibility. Certainly sub-optimal in terms of performance.

link