Hacker News new | ask | show | jobs
by SahAssar 746 days ago
Ah, so what you actually mean is more performant (for a subset of uses), not simpler?

So if I have a TSV and a CSV containing either pure numbers or complex data (say the contents of each file in a codebase where each row likely contains both commas and tabs), they would be equivalent in both performance, right?

If I have a TSV and a CSV containing natural written language TSV might be more performant since there are likely much more commas than tabs (I'm guessing this is your point?).

Regardless of the input data the encoding/decoding code would be equally simple (since they need to account for the same edge cases), right?

2 comments

Sorry, I probably chose my words incorrectly. I meant "work" and "time" in relation to human work to produce the parser.

With CSV, it's more likely you'll encounter data where you need to implement the escaping. With TSV, you can get away with the simple parser for much longer, as it's comparatively rare to find data that contains tabs.

If you propose to use a TSV parser that does not handle escaping then that sounds very unsafe to me. Do you also want to skip checking for escaped newlines? Or escaped backslashes?

What you are proposing is not using TSV, but a format that completely bans newlines or tabs in any data. There are certainly uses for such a format but to make it non-risky to use you'd need strict validation on the input to the encoder and make it very clear that it is not TSV, since it does not follow the rules of TSV encoding/decoding and will not produce the same data as a proper TSV implementation.

We're coming at this from different angles. I completely agree with you that the proper way to read these files is using a fully standards-compliant parser. You make the distinction that a parser that can't handle tabs in the data doesn't technically parse "TSV", instead a subset of TSV-like files with limitations - sure, that makes sense.

What I'm trying to get at, is that there are situations in which implementing such a limited parser is justifiable (and for the main discussion in this thread, TSV makes this more commonly achievable than CSV).

With the luxury of time, all our parsers would handle delimiter escaping, unicode, control characters, byte order marks, etc, perfectly, and truly parse "TSV" and "CSV. Personally, I work on-call in SRE - if something is broken, we need solutions NOW. If I have a CSV of stuff, I am not going to implement a proper parser, I don't even have time to boot up a programming language with a CSV library, I am going to split by comma in the terminal of whatever box I'm logged into to get what I need. Most of the time it'll work, and to the discussion in the thread, TSV makes it more likely to work because it's less likely for the delimiter to be in the data. Less likely to need need those 5-6 extra characters of regex lookbehind.

My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them. Sometimes the data's really messy and I need a real parser, but TSV makes this less likely.

> My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them

My point is that you are not really talking about CSV/TSV since your parser does not handle CSV/TSV. You are using a custom dataformat. Which is fine and perfectly reasonable, and its probably specified to avoid all those issues.

But it is not CSV or TSV. When you say "a simple not-really-TSV/CSV parser to read them" you mean you are not using CSV or TSV. That's fine for non-CSV and non-TSV. usage. Just be clear about what format you are actually using and specify it. It clearly isn't TSV or CSV.

Thanks for the explanation. Ah, I think I see where our difference is.

A website produces a file with a normal CSV exporter. This is a fully standards compliant proper CSV. I call this a CSV. They provide this file for download, I download it unchanged. By this point, I still call the file a CSV.

Next, I parse the CSV file with my non-CSV parser. Here's our point of contention: I still think the original file is a CSV; I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.

I was thinking from the perspective of the file itself and where it came from, so using an incorrect parser doesn't change it. You were thinking in terms of the grammar accepted by the parser I'm using - assuming the parser is appropriate, it's impossible for me to be reading a CSV, it must be something else CSV-like.

I think we are both right, and I think we both understand where the other is coming from.

> I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.

Not quite my opinion. The file is still a CSV file, but IMO the parser is not a CSV parser unless it supports the full spec. The file is still CSV, and it happens to be compatible with the incomplete parser because it does not use any "harder" CSV features.

Lets say we have a website that uses UTF-8 (declared via content-encoding and similar). Some pages on this website only uses ASCII, some uses higher codepoints within UTF-8.

I can parse some of these pages with a ASCII decoder, but that does not mean that my ASCII decoder is a UTF-8 decoder since it only handles a very small subset of UTF-8 that aligns with ASCII. In this example your CSV-lite would be like ASCII and CSV would be UTF-8.

It is simply easier to take basic written human text and put it into a TSV than a CSV as humans use commas far more than tabs. You can replace a TAB with spaces and keep things legible but replacing a comma in a sentence can literally change the meaning.

Maybe you could wrap everything in quotation marks but that is ugly.

It's not easier or simpler since you need the exact same checks and steps, just with a different delimiter. It doesn't matter if you need to do them less times for certain inputs since you need the same checks, the same encoding/decoding steps and so on.

Do you think you would have an easier time writing a TSV parser than a CSV one? If so, why?

And wrapping in quotes does not solve anything since now you need to both check for escaped quotes and tabs/commas. It's the same but one level deeper.

I think what they're saying is that with some minor control over the data in your dataset, you don't need to care about escaping _in your parser_ at all. The same might be said of CSV but I would argue that in the majority of situations tabs are less semantically meaningful than commas and newlines, so it is generally fine just to strip them out.

Obviously this is not robust solution, but in cases I've seen, it works adequately. If one were to be doing it "the right way" then I agree with you wholeheartedly.

I get what you are saying, but my point is that is not CSV or TSV. It's a homemade format with its own rules that just happens to be inspired by TSV or CSV.