Hacker News new | ask | show | jobs
by almostgotcaught 754 days ago
> Usally, I'd cast my arrays into a pandas DF

I promise I mean no offense by this but this is so comically absurd. Like you know it's not a cast right? Ie that you're constructing pandas dataframes.

> How should I reason about the tradeoff of using something like this vs pandas/numpy ?

For small sizes, operations on native types will be faster than the construction of complex objects.

2 comments

Also, my grief with DF is they aren't typed (typing module) by column. Maybe that's changed though? It's been a while.

The only way to understand what's going on with DF code is to step it in a debugger. I know they can be much faster, but man you pay a maintainability price!

This is incorrect: each column in a pandas DFs can have a separate type (what you're asking for is compatibility with Python's type-hinting on a per-column basis, though, which is different), and you can debug the code without needing a debugger: I use pandas regularly and I've never needed to use a debugger on pandas.

(Sure, it's easy to write obfuscated pandas, and it sometimes has version-specific bugs or deprecations which need to be hacked around in a way that compromises readability, and sometimes the API has active changes/namings that are non-trivial. But that's miles from "only way to understand is with a debugger". If you want to claim otherwise, post a counterexample on SO (or Codidact) and post the link here.)

Yeah, that's what I meant. I would like per column type-hinting so that data frames are type-checked along with the rest of our stuff and everything is explicit.

I don't have anything I can show because the stuff I was working on was commercial and I don't code Pandas for fun at home ;)

The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

Part of the issue was my unfamiliarity with Pandas, for sure. But if I just picked a random function in the code, I would have no idea as to the shape of the data flowing in and out, without reading up and down the callstack to see what columns are in play.

Breakpoint and then look at the data, every time!

For type-hinting on dataframe-like objects, people recommend pandera [0].

> The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

(Protein folding?)

Anyway yeah if your codebase is a large proprietary pipeline that thunks to and from pandas-numpy then now I understand you. But that's your very specific usecase. The claim "The only way to understand what's going on with DF code is to step it in a debugger" is in general overkill.

[0]: https://pandera.readthedocs.io/en/stable/

They effectively are since each column is a series, which is typed.
I happen to know a book or two that might help with Pandas.

(Disclaimer: I wrote three of them and spend a good deal of my time helping others level up their Pandas. Spent this morning helping a medical AI company with Pandas.)

No offense taken.

My tasks aren't usually bottlenecked by the df creation operation. To me, the convenience offered by dfs outstrips the compute hit. However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

> However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

My friend it's much worse than a single order magnitude for small inputs

    import time
    import pandas as pd

    ls = list(range(10))

    b = time.monotonic_ns()
    odds = [v for v in ls if v % 2]
    e = time.monotonic_ns() - b
    print(f"{e=}")

    bb = time.monotonic_ns()
    df = pd.DataFrame(ls)
    odds = df[df % 2 == 1]
    ee = time.monotonic_ns() - bb
    print(f"{ee=}")
    print("ratio", ee/e)

    >>> e=1166
    >>> ee=656792
    >>> ratio 563.2864493996569
my experience is also that numpy and pandas can add 1-2 seconds to python startup time (which is terrible for the testing experience).