Hacker News new | ask | show | jobs
by xg15 822 days ago
The most effective combination I've found so far is jq + basic shell tools.

I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places and has some awkward design choices in others, which means that a lot of practical everyday tasks - such as aggregations or even just set membership - are a lot more complicated than they ought to be.

Luckily, what jq can do really well is bringing data of interest into a line-based text representation, which is ideal for all kinds of standard unix shell tools - so you can just use those to take over the parts of your pipeline that would be hard to do in "pure" jq.

So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:

curl ... | jq '.[].license.key' | sort | uniq -c

That's it.

8 comments

> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places

After a few years of stalled development, jq has been taken over recently by a new team of maintainers and is rapidly working through a lot of longstanding issues (https://github.com/jqlang/jq), so I'm not sure if this is still the case

Wasn't aware of that, that's great to hear! I think if there is one utility that deserves a great maintainer team then this one. But if we saw some actual improvements in the future, that would be awesome!

I have a list of pet peeves that I'd really like to see fixed, so I'm gonna risk a bit of hope.

As an old Unix guy this is exactly how I see jq: a gateway to a fantastic library of text processing tools. I see a lot of complicated things done inside the language, which is a valid approach. But I don’t need it to be a programming language itself, just a transform to meet my next command after the pipe.

If I want logic beyond that, then I skip the shell and write “real” software.

I personally find those both to be more readable and easier to fit in my head than long complex jq expressions. But that’s completely subjective and others may find the jq expression language easier to read than shell or (choose your programming language).

Your comment made me go look up jq (even more than the article did) and the first paragraph of the repo [0] feels like a secret club's secret language.

I'm very interested, but not a Linux person, do you know of any good resources for learning the Linux shell as a programming language?

[0] https://jqlang.github.io/jq/

I’ll say, I did shell scripting for years from copy/paste, cribbing smarter people, and reading online guides. But I didn’t really understand until I read The Unix Programming Environment by Brian Kernighan and Rob Pike.

It’s a very old book and the audience was using dumb terminals. But it made me understand why and how. I think I’ve read every Kernighan book at this point and most he was involved in because he is just so amazing and not just conveying facts, but teaching how to think idiomatically in the topic.

I also used awk for 2 decades, kind of like how I use jq now. But when I read his memoir I suddenly “got it.” What I make with it now is intentional and not just me banging on the keyboard until it works. A great middle ground for something a little sophisticated, but not worth writing a full program for.

Something else that helped me was to install a minimal distro… actually a base FreeBSD install would be great… and read the man pages for all the commands. I don’t remember the details, but I learned that things existed. I have many man pages that I look at the same options on every few months because I’m not positive I remember right. Heck, I ‘man test’ all the time still. (‘test’ and ‘[‘ are the same thing)

I also had an advantage of 2 great coworkers. They’d been working on Unix since the 80s and their feedback helped me be more efficient, clean, and avoid “useless use of cat” problems.

I also highly recommend using shellcheck. I sometimes disagree with it when I’m intentionally abusing shell behavior, but it’s a great way to train good habits and prevent bugs that only crop up with bad input, scale, etc. I get new devs to use it and it’s helped them “ramp up” quickly, with me explaining the “why” from time to time.

But yeah. The biggest problem I see is that people think there is more syntax than there really is (like my test and [ comment). And remember it’s all text, processes, and files. Except when we pretend it’s not ;).

Really love your comment, so much that I wanted to check out the books you mentioned.

After searching z-lib for "The UNIX Programming Environment", all I found was a janky and grainy PDF. Then I searched archive.org and discovered this high fidelity PDF version:

https://archive.org/details/UnixProgrammingEnviornment

Note: Sadly, the EPUB version is 10x larger (370MB) and is corrupted, not able to be opened / viewed.

Much thanks for this. I've been struggling to grasp what's happening down in the engine room beneath me as I operate a modern Linux environment and only knew that there have been different waves of evolutions over the decades. I instead only see today's pretty deck up top, outside.

I'm only ten pages in, but that's mostly because the format and approach of this book is quickly yielding me tons of conversation fodder for ChatGPT4, where I've been endlessly asking it to clarify or point out holes in my mental model of everything I'm doing in the terminal, and still working out things such as why Backspace, Ctrl+H, Ctrl+/, Ctrl+?, Ctrl+-, and Ctrl+_ all together seem to have some overlap or differences in what's happening, depending on terminal contexts. I always had working notions of raw versus cooked sessions, the line discipline, etc., but I'm finding ways to play with the machinery with silly exercises that I otherwise couldn't have come up with.

For example, I just opened up three terminal windows:

  1) `man ascii`
  2) `nc -lvp 9001 | xxd -c1`
  3) `stty raw -echo; nc -nv 127.0.0.1 9001`
Then, reproducing the whole "Hex" column of the manpage, in order, using my keystrokes. And, observing the multi-byte payloads of some of the other keys. I feel like a kid again :)
> The Unix Programming Environment

How does this compare to The Art of Unix Programming, if you've read both?

I don’t find that book to be very useful at all.

I’m kind of annoyed by the bait and switch of the title. It’s a play on Knuth’s classic but then turns into showing why Unix/Linux is better than Windows, etc.

As a disclaimer: I really don’t respect ESR and his work, and admire Brian Kernighan immensely. Very odd to be in a situation where those names are put side by side. Just want to call out that I do have bias on the people here. Don’t want to get into why as that’s not constructive.

I wasn't aware of the bait and switch at the time I read it, but I did really enjoy the history of how the Unix/Linux ethic came together and evolved over time. Had I heard of The Unix Programming Environment when I read it in 2014 I may have gone with that instead, as I was looking for something more along the lines of a technical handbook rather than a code of ethics.
So, grab yourself a Linux box (I suggest Debian), a large CSV file or JSON lines file you need to slice up, and an hour of time, and start trying out some bash one-liners on your data. Set some goals like "find the Yahoo email addresses in the data and sort by frequency" or "find error messages that look like X" or "find how many times Ben Franklin mentions his wife in his autobiography"

Here's the thing. These tools have been used since the '70s to slice, dice and filter log files, CSVs, or other semi-structured data. They can be chained together with the pipe command. Sys admins were going through 100MB logs with these tools before CPUs hit the gigahertz

These tools are blisteringly fast, and they are basically installed on every Linux machine.

https://github.com/onceupon/Bash-Oneliner

And for a different play-by-play example:

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

>Your comment made me go look up jq (even more than the article did) and the first paragraph of the repo [0] feels like a secret club's secret language.

Or one of the most old standing widespread clubs of computing open standard language :)

"jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text."

Translation:

JQ is like a (UNIX/POSIX staple command line text-manipulation tool) but specialized for text structured in JSON format. You can use it to extract parts of a JSON document (slice), keep nodes based on some criteria (filter), transform each element in a list of structured data to get a new list with the transformed versions (map), and do that as easily as you can with the sed (basic command line text manipulation program), awk (command line text manipulation program with a full featured text-processing oriented language), grep (command line program to search for strings), and other assorted unix userland programs.

I do the same thing all the time, lol

but have in the last year or so, tried to start writing things that will be read/used by other people in python or java

why? because most people don't have a clue how bash scripts work, but can read/debug python and java with ease, and both can be run as shell scripts too

the mac has sqlite installed by default as well, so there's a powerful combo available on every mac

but I totally do what you're doing all the time, and it is my default go to if I have to get something done fast :D

Driving that further - I don’t want to have to edit my query for minutes when I am in the shell. I don’t believe that the shell is conducive to complex SQL queries. You could write simpler SQL queries, but then you’re in a space where there are less verbose tools for those simpler tasks.

DuckDB seems stronger for someone who needs to create a scripting library - which also has lots of options and competition - or someone who has a very specific workflow of working with JSON dumps for a huge percentage of their time.

Your command line solution doesn't give quite the same result as OP. The final output in OP is sorted by the count field, but your command line incantation doesn't do that. One might respond that all you need to do is add a second "| sort" at the end, but that doesn't quite do it either. That will use string sorting instead of proper numeric sorting. In this example with only three output rows it's not an issue. But with larger amounts of data it will become a problem.

Your fundamental point about the power of basic shell tools is still completely valid. But if I could attempt to summarize OP's point, I think it would be that SQL is more powerful than ad-hoc jq incantations. And in this case, I tend to agree with OP. I've made substantial use of jq and yq over the course of years, as well as other tools for CSVs and other data formats. But every time I reach for them I have to spend a lot of time hunting the docs for just the right syntax to attack my specific problem. I know jq's paradigm draws from functional programming concepts and I have plenty of personal experience with functional programming, but the syntax and still feel very ad hoc and clunky.

Modern OLAP DB tools like duckdb, clickhouse, etc that provide really nice ways to get all kinds of data formats into and out of a SQL environment seem dramatically more powerful to me. Then when you add the power of all the basic shell tools on top of that, I think you get a much more powerful combination.

I like this example from the clickhouse-local documentation:

  $ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' \
      | clickhouse-local --structure "user String, mem Float64" \
          --query "SELECT user, round(sum(mem), 2) as memTotal
            FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
You can archive that by appending sort -n, so the whole thing becomes:

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n

You can even turn it back into json by exploiting the fact that when uniq -c gets lines of json as input, it's output will be "accidentally" parseable as a sequence of json literals by jq, where every second literal is a count. You can use jq's (very weird) input function to transform each pair of literals into a "proper" json object:

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n | jq '{"count":., "value":input}'

> I think it would be that SQL is more powerful than ad-hoc jq incantations.

I don't disagree, but only because I have a lot of experience with SQL so I "think" more in those terms. The blog post made perfect sense to me, and I have to go to claude/openai/copilot for whatever I need in `jq`, EVERY TIME. Because I don't use it nearly as often, so I don't have its language internalized.

Readability (and I'd posit to the point of this, power) is more a function of the reader than the code, pg's "blub paradox" notwithstanding.

I didn't know anything about duckdb, but I might give this a shot.

For reference, sort(1) has -n, --numeric-sort: compare according to string numerical value.
> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it [...]

It's basically just functional programming. (Or what you would get from a functional programmer given the task of writing such a tool as jq.)

That's not to diminish jq, it's a great tool. I love it!

found out recently that jq can url-encode values, is there anything it _can't_ do?
Finding out whether . is contained in a given array or not, evidently.

(That's not strictly true - you can do it, you just have to bend over backwards for what is essentially the "in" keyword in python, sql, etc. jq has no less than four functions that look like they should do that - in(), has(), contains() and inside(), yet they all do something slightly different)

The Unix philosophy continues to pass the test of time.
Yes and no. Many UNIX philosophy proponents are abhorred by powerful binaries like jq and awk.
Like any religion, there are zealots and "no true Scotsman" arguments a'plenty.