| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nas 3117 days ago

Pandas is useful and I don't want to bad mouth it as people obviously find it useful. However, it has a complicated API and contains about 200k lines of code. So, it is not a surprise that documentation is a challenge and that there are lot of Stack Overflow questions. For example, figuring out which method result in copies of the data vs new views is hard.

Compare with dlply. It solves a similar problem as pandas does but has a vastly simpler API. To be fair, Pandas does do more but dlply is also more flexible. I looked at implementing something like dlply in Python but you really need to have a lazy evaluation syntax. dlply makes extensive use of this feature of R. As the downside, it can be very confusing to new users as it is hard to debug this lazy evaluation code.

Rather than adopting Pandas to build our product, I built a very minimal version of it (on top of numpy) that only does what we need. That was some extra work but I'm happy I did it as we avoid this huge dependency. I understand quite well my little minimal version does, it is only about 1000 lines of Python code and some tiny C extensions.