Hacker News new | ask | show | jobs
by snidane 914 days ago
This looks great!

Please consider removing any implicit network calls like the initial "Checking GitHub for updates...". This itself will prevent people from adoption or even trying it any further. This is similar to gnu parallel's --citation, which, albeit a small thing - will scare many people off.

Consider adding pivot and unpivot operations. Mlr gets it quite right with syntax, but is unusable since it doesn't work in streaming mode and tries to load everything into memory, despite claiming otherwise.

Consider adding basic summing command. Sum is the most common data operation, which could warrant its own special optimized command, instead offloading this to external math processor like lua or python. Even better if this had a group by (-by) and window by (-over) capability. Eg. 'qsv sum col1,col2 -by col3,col4'. Brimdata's zq utility is the only one I know that does this quite right, but is quite clunky to use.

Consider adding a laminate command. Essentially adding a new column with a constant. This probably could be achieved by a join with a file with a single row, but why not make this common operation easier to use.

Consider the option to concatenate csv files with mismatched headers. cat rows or cat columns complains about the mismatch. One of the most common problems with handling csvs is schema evolution. I and many others would appreciate if we could merge similar csvs together easily.

Conversions to and from other standard formats would be appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe compression formats as well - especially zstd.

It would be nice if the tool enabled embedding outputs of external commands easily. Lua and python builtin support is nice, but probably not sufficient. i'd like to be able to run a jq command on a single column and merge it back as another for example.

Inspiration:

  - csvquote: https://news.ycombinator.com/item?id=31351393
  - teip: https://github.com/greymd/teip
4 comments

You can get quite far by piping to other tools and/or using DSLs. pivoting can almost certainly be done by the luau support in qsv (or `vnl-filter`, for instance). Summing and grouping is something that `datamash` does well (or qsv luau probably, or `vnl-filter --eval`). Adding a column once again can be done with luau or `vnl-filter`.

Would you be more likely to use this tool if it had even more stuff in it requiring reading even more documentation? That's a genuine question.

Thanks for the detailed feedback @snidane!

As maintainer of qsv, here's my reply:

- Given qsv's rapid release cycle (173 releases over three years), the auto-update check is essential at the moment. Once we reach 1.0, I'll turn it off. For now, given your feedback, I've only made it check 10% of the time.

- Pivot is in the backlog and I'll be sure to add unpivot when I implement it. (https://github.com/jqnatividad/qsv/issues/799)

- I'll add a dedicated summing command with the group by (-by) and window by (-over) capability (https://github.com/jqnatividad/qsv/issues/1514). Do note that `stats` has basic sum as @ezequiel-garzon pointed out.

- With the `enum` command, qsv can achieve what you proposed with `laminate`. E.g. qsv enum --new-column newcol --constant newconstant mydata.csv --output laminated-data.csv

- With the cat rowskey command, qsv can already concatenate files with mismatched headers.

- other file formats. qsv supports parquet, csv, tsv, excel, ods, datapackage, sqlite and more (see https://github.com/jqnatividad/qsv/tree/master#file-formats). Fixed-format though is not supported yet and quite interesting, and have added it to the backlog (https://github.com/jqnatividad/qsv/issues/1515)

- as to "enable embedding outputs of commands", qsv is composable by design, so you can use standard stdin/stdout redirection/piping techniques to have it work with other CLI tools like jq, awk, etc.

Finally, just released v0.120.0 that already incorporates the less aggressive self-update check. https://github.com/jqnatividad/qsv/releases/tag/0.120.0

I know this is just one thing out of many, but sum is included in stats.
Wait, who is scared off by parallel's --citation?
I refuse to use parallel due to that obnoxiousness.

At minimum, it is not installed by default, so it is already a negative to just using xargs. That it then puts that barrier in my way makes it an easy tool to skip.

I just don't understand what barrier you are talking about. I just checked, it doesn't even whine at you when you use it, the help just notes that you should cite it if you publish a paper where you used it. And... anyone publishing papers knows about citation requirements lol. Anyone else can ignore it. What is this barrier?
It’s just so obnoxious is the main thing. Imagine citing every piece of software one uses for writing a paper — the citation list would be endless. C, Unix, Fortran… the list goes on. Why is this particular utility so different and special? Just such a jerk move and it makes me and many others aggravated at the arrogance of the author.
I am somewhat tickled at the thought of citing everything in a malicious compliance kind of way. Given a Nix environment, it should be possible to pull down a list of every bit of code that was used to construct the OS. Would we have to differentiate between installed vs executed code? My Latex environment probably has thousands of packages, though I might directly only include a handful of them. Even if I include a Latex package, it might not get executed.

The CITATION.cff format[0] is a newish format to solve the machine identification of citable works, but I suspect it is too new to see widespread adoption. It is going to take some backbreaking regexes to extract "How to Cite" sections embedded in READMEs and buried in the source.

[0] https://citation-file-format.github.io/

In addition to being annoying, it raises questions about whether it is free software or not. Some people care a whole lot about that. And some people have higher standards about being nagged. And lots and lots of time was spent discussing solutions, for instance: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=915541
Ah, I see, they have changed it (or possibly the version on my system has had the --will-cite patched out, as discussed in this bug).

Okay, I accept your argument about Free Software. However, I find it interesting that it's a GNU project... they are generally the most hardline Free Software people.

To slippery slope this, what happens if more tools start adopting this behavior? Curl now asks you to buy Daniel Stenberg a coffee on each use. Wget asks you to support Ukraine. Caddy wants you to invest in their startup. Each of which may come with their own `--ignore-annoyance-flag` I need to learn. The best I can do is vote with my feet.

I also do not care for the citation requirement. I utilize tons of tools in my work which go unstated. I do not feel the need to cite Linux, DNS, htop, Make, Diet Coke, my Kinesis keyboard, etc. Sadly, reliable plumbing gets no respect. Especially for a tool which is more or less interchangeable with some shell scripting. Unless I am trying to shore up the references list, I am going to cite directly relevant work.

At some point, you no longer need to note that your work was powered by electricity.

You paid for your Kinesis keyboard and your electricity. "Reliable plumbing gets no respect." And yet you find it offensive that they would ask for some academic cred in exchange for thousands of hours of development on a free tool. Sorry you're annoyed. If you wrote and maintained a popular free software package, you'd see that what it gets you is no respect, only entitled complainers who want you to do even more free work for them.
At some point when you are 11 years old, if you didn't think of it already for yourself, it is some parent or teacher's job to tell you to appreciate all the stuff that people have given to the world, which you now get to use. And that's the only time that lecture is appropriate and that's the only person it's appropriate from.

If it's, we'll say within someone's rights to nag, it's exactly equally valid to object to it and to point out what kind of failing it is.

When I produce something that is free, it's free. If I want any more credit than simple copyleft already provides, then I'll charge for it.

Nagging the user is not unlike "you suck it up and tolerate the ads because that's how you pay for it instead of cash."

A directive to cite, or even a request especially when it's a nag in the tool rather than just somewhere in the docs or on the web site, is a string, a payment. It shows that the author does not actually understand the rationale behind free software. They want something back for their effort more than what they already got, which is the huge sea of developed software they get to use with no strings themselves.

Free software is a gift. It's about to be xmas. How many of the gifts you will give this weekend will you pair with a reminder to thank you for that awesome thing you gave them? The various recipients moms have the job to tell them that, not you or me.

Vim has solicited donations for Uganda since forever.
I find it incredibly obnoxious and I refuse to use parallel because of it. To me, it violates the spirit of free software and tarnishes the GNU project. As someone who has released my source to the public for free, I couldn't fathom adding such a flag.

Bonus SO post to enhance your fury:

https://stackoverflow.com/questions/61762189/installing-gnu-...