Hacker News new | ask | show | jobs
by thekelvinliu 2292 days ago
what do you do instead?
5 comments

Not OP, but I imagine he's bought into the Hadoop and other big data stories. Most data processing of large data sets can probably be done with standard UNIX tools.

There is nothing new under the sun, it's all just rebranded.

If you do "object manipulation" e.g. columns/records/objects etc then at least Powershell offers some sanity.
Shameless plug but this is what I wrote murex for. Murex is a "UNIX" shell that's keeps enough similarities with POSIX that you can use it like a Bash REPL with (hopefully) minimal disruption but it breaks from POSIX compatibility where it makes sense. And one significant area it does break compatibility is how it handles structured data formats like JSON, S-Expressions, CSV's, and other tabulated data (to name a few).

It's designed from the ground up to support object manipulation while still retaining compatibility with the UNIX pipeline.

I do this by building a suite of builtin tools that are aware of structured data files (primarily because that information is passed down the pipeline as a data-type) but it still breaks into normal pipeline when forking an external executable.

https://github.com/lmorg/murex

Output data as JSON, manipulate and show it with jq
I hate jq with the power of a million suns. The whole point of text through pipelines is that data can be processed by tools that do not understand it. Formatting it in xml or json (there's no difference) breaks this beautiful orthogonality, and forces all the intermediate tools to deal with whatever the stupid markup du jour happens to be.
not all data comes from sources that you control and have chosen how to output.
https://github.com/kellyjonbrazil/jc can come pretty handy there
oh my god, why?! just why!?

What the world needs is the inverse program of "jc", where an unparseable json string is expanded into a flat list of lines all of the form "field.subfield=value"

If you search for "gron", you will find a family of such tools, e.g., https://github.com/tailhook/rust-gron.
Thanks, that's just what I needed!
ok well I can understand not wanting json when all you need is something simpler, but I'm not sure if I understand unparseable - I mean if it is JSON then it is parseable.
You can only parse json easily by using json libraries. Plain text, or "field=value" pairs, you can easily cut(1) or grep(1), or sed(1) to your pleasure. This is what I mean by parseable. Parseable trivially by tools that do not understand the format. I can also sed and awk json files, and I do, but it is extremely painful; and more often than not these files are nothing more than simple lists of variables with values, for which the use of json is a ridiculous overkill.
thanks!
We use Groovy or Kotlin with KScript.
Groovy is particularly nice I find. Completely cross platform and supports command line args similar to Perl to enable inline / pipe style processing easily, but if you are using Java in your back end you can throw any of your business logic in there too and use those.
Use any modern programming language.
The gnu tools are super fast and do the job. Ive seen people spinning their own solutions which end up being super slow and arguable take more time to develop.

With any gnu tools it is great that they will stay there for your life and are usually by default installed on every system

Yeah, except anything that is not trivial ends up being write-only
Modern programming language features are overrated. I prefer to choose the tools that are the most portable and involve the least social friction (i.e. something that is widely understood by people I work with).

For "fun" projects (stuff I'm not paid for) and workflow optimization, I just care about portability, which means C (with heavy use of the C stream library, a simple collections library of about 200 LOC, and occasionally POSIX syscalls) and shell scripting. After spending a lot of time learning languages as a hobby, I just don't believe the dark corners and warts of C and shell scripting are any worse than other languages.