Hacker News new | ask | show | jobs
by xaybey 3361 days ago
I usually use csvkit (https://csvkit.readthedocs.io/en/1.0.1/index.html). There are commands to list the columns, filter, browse the data in a somewhat formatted way with less. Typing the commands is a huge pain though, and I would be very interested in a tool that could instantly pop open and let me peruse. Excel can take minutes to load and eagerly does a lot of unhelpful formatting on things like dates and decimals.

That being said I would never want to (and often legally cannot) etl my data to some third party. That would be terribly slow. But I would happily pay for a nice desktop tool to do it for me; command line or GUI.

5 comments

At a former job, our embedded device logs decoded to csv. Some of them were too large for Excel.

Pandas handled them without a burp. Pandas in Jupyter (Ipython Notebook) was a godsend.

There's a minimal amount of variable setup, but once you've done that once or twice it's easy.

Of course, any analysis or manipulation takes a bit of python code, but I see that as a feature, not the least because you can read it right there in the open instead of having to hunt for formulas in cells.

Same here. I load the data using pandas or parsing the rows by hand in go. It's interesting to watch your RAM getting filled up while the data is loaded.

Looks like this tool is for non-programmers, it's interesting to see that there seems to be a market here.

I've been prototyping simple desktop GUI tools on top of dask/pandas and PyQt that let you lazily load large CSVs (and other types supported by pandas) and interactively filter based on smart histograms (the per column histograms are fully interactive and provide crossfiltering across the attributes):

http://imgur.com/a/vfAmV

The idea is to map a lot of the basic functionality of dataframes onto simple GUI interactions (for example, changing column types, stacking and unstacking columns, pivoting) and couple that with an ipython console for more complicated data manipulation. And then maybe even adding adding Tableau like charting functionality:

http://imgur.com/a/z8d1w

For quick throwaway exploration/analysis. It can easily handle about a million rows just using generic pandas and a bit of memory. There's lots of cool database techniques that can also be used on small local data (for example, compressed bitmaps using EWAHBool for interactive filtering).

Do you plan to release something soon? Even just a way to visualize the rows by loading them lazily would be a huge improvement. I personally use pandas but some of my colleagues are not familiar with it, and it pains me when they try to inspect a large dataset by opening it on Excel on our small university-provided desktops instead of spending a couple of minutes writing a few python lines to extract what they need.
It sounds like CSV Explorer might work well for them.
I regularly test applications that generate big CSV reports. As I don't always have influence on the input data (we want to test on real datasets pulled from production servers of our partners), I fall back to defining constraints that must be satisfied. I ensure that they remain satisfied by analysing the csv files with powershell (of all the tools).

I just find the magic of "import-csv bla.csv | where some condition | select some existing or calculated value | group | format-list | out-file output.txt" to be extremely helpful, it's like SQL for the csv files.

And the ability to question live services and parse JSONs, and cross-check with other reports, or merge multiple reports into one... It's indispensable.

shuf -n10000 foo.csv > foo-sample.csv; open $_

That should give you a 10k-line random sample to play with that should open quite quickly in Excel.

Csvkit is great, and I use it a lot. The target users for CSV Explorer though are mostly non-engineers ie. people who don't use the command line or code.
Kibana has quite a few rough edges but it takes non-technical users very far in terms of data exploration. Local setup also alleviates privacy concerns. The bummer is import/export: 'Upload CSV' feature has been almost introduced in 5.0 release but removed in the end [1], and CSV export has been asked for many times [2] but is still not there. So a user-friendly fork might be a worthy business idea.

[1]: https://github.com/elastic/kibana/pull/8497 [2]: https://github.com/elastic/kibana/issues/1992