Hacker News new | ask | show | jobs
by foehrenwald 906 days ago
related: https://github.com/johnkerl/miller

I am wondering who really uses these tools and for what since there are R and python data science tools available?

4 comments

For simple analyses (i.e. what most people do most of the time) doing this on the commandline gets you there faster. I use vnlog (https://github.com/dkogan/vnlog/). By the time you fired up your editor to write your Python code, I already have analyses and plots ready.
I write Python every day, but still use miller here and there. If I am doing a "simple" operation (eye of the beholder), being able to pipe it on the command line is great.

To do a comparable amount of manipulation in Python takes a lot more boilerplate (imports, command line arguments, diety-can-we-default-to-Int64 already?, etc), plus you have to ensure you have a virtual environment with correct dependencies. Which is more or less standard numpy+pandas, but a single executable tool to do some data workup is always appreciated.

I am never performance constrained, but I have been told that miller is one of the slower tools in this space, but I still reach for it do to its wide format support.

Out of core computations. While your python and R script will choke after reading few hundred megs, my compiled binary cli will keep streaming through many such files with memory usage sitting somewhere near zero.
That’s just the effect of streaming IO vs reading in the file into memory all at once. That has nothing to do with the language you use, but how you process the data.

I keep multiple little Python scripts around to do things like sum lists of numbers (think extracting a column with awk, then calculating a sum). Compiled vs an interpreted script really doesn’t matter. What matters is using the right algorithm for the job. R and Python data science libraries like to read in all of the data at once into one single data structure. That’s the anti-pattern to avoid if at all possible.

(But they are very handy for small datasets of complex calculations that require the entire dataset in memory. )

qsv is a fork of xsv — the latter hasn't been maintained in a while.