Hacker News new | ask | show | jobs
by rhelz 795 days ago
> Like, no human is going to read 50k rows, much less 10m rows.

Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

Your bleary eyes are scaring through log files and data files, trying to find the answer..

... believe me, you will appreciate human-readable formats for both of those. You just want to cat out the the entries in the db which the new code can't handle... the last thing you want to do is to have to invoke some other tool or write some other script to make the data human readable.

And when you find the problem, you will want to just be able to edit a text file containing test cases to verify the fix.

You don't want to write some script to generate and insert the data....at 2am, you are likely to write a buggy script which may keep you from realizing that you've already fixed the problem....or worse, indicate that you have fixed the problem when you haven't.

Fewer moving parts is always better.

3 comments

> Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.

Parquet, used by basically everyone, isn't human readable (and for good reason): it's for big data storage, retrieval, and processing. CSV is human readable (and for good reason): people use that data in Excel or other spreadsheeting software.

> This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.

I feel like your comment is a nitpick. Crap getting broken for whatever reason happens. Having human readable things can be helpful for development or fixing things.

Of course, this isn't always the top priority-- other things, like being able to round-trip non-human readable data, or performance, or data density, may win.

This is a classic I don't have a catchy term for it problem, where someone focusses on the details of a given contrived example and thinks the problem and the solution are all solved by addressing the details of that specific contrived example.

The 2am story above is not the problem. Thinking up something that would have avoided that specific story like "have 3 people in shift rotation instead of just yourself, and then you never have a 2am problem", or "don't push code at 2am" is not a solution to the problem.

The value of simple data formats that can be directly read by a human without any special tools, is that it makes the data accessible to reading, analyzing, even processing or modifying, in all of the unknown unknowable infinite possible situations.

You don't know ahead of time that you will one day solve a problem deep in the trenches by being able to read or maybe even modify a file or the stdin in a cgi before some crazy untouchable special app picks it up. You don't know ahead of time that the platform will not have a db client you can use to access the data, but you do know that everything can process text, even an obscure cpu with no gcc or git or any of your usual nice toys, has some sort of shell and some sort of text editor.

You can't touch the main app which is some legacy mainframe banking thing or something, you don't have and can't install or compile anything, but you can still read the data and see that there is some unicode character scattered all through it, and you can set up a dirty hack stream edit to convert it to a single byte ascii replacement, using nothing but plain posix sh or some equivalent no matter what the platform.

And that ugly hack is nine thousand times more useful to the bosses and to yourself than not being able to see what was wrong with the data, and then only being able to say "the other side is sending us bad data, it will be broken until they fix their end, or until we can modify the crazy untouchable thing on our end, because I'm a helpless useless twat"

You can't predict ahead of time exactly when or why or how you will end up wanting to be able to access the data without the normal proper tools or apps from the happy path. But it's a fact that it happens, and having the option is more useful than not having the option. And being the person who can solve a problem is more useful than being the person who can't do anything any other way except the normal expected way.

No one said this trumps all other considerations for all jobs for all data, just that it's very valuable, a consideration among other considerations, and you can't predict all the specific ways in which it is valuable, and so giving it up has to be necessary not thoughtless.

I’ve never been frustrated at 2am that my data in sqlite3 or Postgres isn’t in a human readable disk format.

If I’m working with parquet I’ll have duckdb on hand for fiddling parquet files. I’m much better at SQL at 2 am than I am at piping Unix tools together over N files.

I have no idea how I’d drop bad rows from this thing with a bash pipeline anyways, I need to select from one file to find the bad line numbers (grep I guess, I’ll need to look up how to cut just the line number), and then delete those lines from all the files in a zip (??). Sounds a lot harder than a single SELECT WHERE NOT or DELETE WHERE.

> I’ve never been frustrated at 2am that my data in sqlite3

What if its not your data? And you've never used sqlite3 or Postres? Its 2AM, they couldn't get a hold of the guy who wrote the code because he's on vacation, or he wrote it 20 years ago and retired....so they haul your sorry self out of bed?

You really gunna be wanting to be reading sqlite3 tutorials, while your boss and boss's boss's boss is on the video call?

I don't want to overstate my case, and sure there are plenty of reasons to use a database to store your data. I was just trying to answer the question of the person who asked why human-readable formats are preferable. They are not in all circs, of course, but all other things being equal you will appreciate it when the fit hits the shan.

It’s unclear to me that this is actually fewer moving parts. There are already parquet CLI tools. If your data is in Parquet you should know how to use them or at least have them documented in your oncall runbook.
I'm sure parquet is the bees knees, and I'm sure if its your code you'll know how to fix it.

But what if it isn't your code? And you've never heard of parquet before? And its 2AM and they can't get a hold of the guy who wrote it, so they call in you....

Then there’s a serious organizational failing. Parquet is the de facto standard, the chances are a random engineer knows how to interact with it and not ZSV.
> Then there’s a serious organizational failing.

Man, if your success is predicated on working at a company with with no organizational problems....

A scrappy start-up can't afford to hire multiple, redundant engineers--and what with all of the massive layoffs happening, even at the big companies a lot of engineers are going to find themselves debugging other people's code.

Then embrace industry standards and don't add unknowns. “There are always organizational problems” is a poor justification for creating more problems.