| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by willf 746 days ago
	Because commas are much more common than tabs and so there is less escaping needed

2 comments

SahAssar 746 days ago

But you always need to have the code for escaping and you always need to check for it. I don't see how an implementation can be simpler without dropping support for tabs within columns, which would make it non-conformant to the spec.

link

RadiozRadioz 746 days ago

Because, while you always _should_ implement the proper escaping, that takes work. Not a large amount of work, but more than zero. In many cases your data doesn't contain commas or tabs, so you can do it the super simple way and get back that time. There are more cases where data is tabless than commaless, so using TSV affords you more opportunities to get this quick and dirty timesave when you need a fast solution.

link

SahAssar 746 days ago

Ah, so what you actually mean is more performant (for a subset of uses), not simpler?

So if I have a TSV and a CSV containing either pure numbers or complex data (say the contents of each file in a codebase where each row likely contains both commas and tabs), they would be equivalent in both performance, right?

If I have a TSV and a CSV containing natural written language TSV might be more performant since there are likely much more commas than tabs (I'm guessing this is your point?).

Regardless of the input data the encoding/decoding code would be equally simple (since they need to account for the same edge cases), right?

link

RadiozRadioz 746 days ago

Sorry, I probably chose my words incorrectly. I meant "work" and "time" in relation to human work to produce the parser.

With CSV, it's more likely you'll encounter data where you need to implement the escaping. With TSV, you can get away with the simple parser for much longer, as it's comparatively rare to find data that contains tabs.

link

SahAssar 746 days ago

If you propose to use a TSV parser that does not handle escaping then that sounds very unsafe to me. Do you also want to skip checking for escaped newlines? Or escaped backslashes?

What you are proposing is not using TSV, but a format that completely bans newlines or tabs in any data. There are certainly uses for such a format but to make it non-risky to use you'd need strict validation on the input to the encoder and make it very clear that it is not TSV, since it does not follow the rules of TSV encoding/decoding and will not produce the same data as a proper TSV implementation.

link

RadiozRadioz 746 days ago

We're coming at this from different angles. I completely agree with you that the proper way to read these files is using a fully standards-compliant parser. You make the distinction that a parser that can't handle tabs in the data doesn't technically parse "TSV", instead a subset of TSV-like files with limitations - sure, that makes sense.

What I'm trying to get at, is that there are situations in which implementing such a limited parser is justifiable (and for the main discussion in this thread, TSV makes this more commonly achievable than CSV).

With the luxury of time, all our parsers would handle delimiter escaping, unicode, control characters, byte order marks, etc, perfectly, and truly parse "TSV" and "CSV. Personally, I work on-call in SRE - if something is broken, we need solutions NOW. If I have a CSV of stuff, I am not going to implement a proper parser, I don't even have time to boot up a programming language with a CSV library, I am going to split by comma in the terminal of whatever box I'm logged into to get what I need. Most of the time it'll work, and to the discussion in the thread, TSV makes it more likely to work because it's less likely for the delimiter to be in the data. Less likely to need need those 5-6 extra characters of regex lookbehind.

My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them. Sometimes the data's really messy and I need a real parser, but TSV makes this less likely.

link

galleywest200 746 days ago

It is simply easier to take basic written human text and put it into a TSV than a CSV as humans use commas far more than tabs. You can replace a TAB with spaces and keep things legible but replacing a comma in a sentence can literally change the meaning.

Maybe you could wrap everything in quotation marks but that is ugly.

link

SahAssar 746 days ago

It's not easier or simpler since you need the exact same checks and steps, just with a different delimiter. It doesn't matter if you need to do them less times for certain inputs since you need the same checks, the same encoding/decoding steps and so on.

Do you think you would have an easier time writing a TSV parser than a CSV one? If so, why?

And wrapping in quotes does not solve anything since now you need to both check for escaped quotes and tabs/commas. It's the same but one level deeper.

link

dafelst 746 days ago

I think what they're saying is that with some minor control over the data in your dataset, you don't need to care about escaping _in your parser_ at all. The same might be said of CSV but I would argue that in the majority of situations tabs are less semantically meaningful than commas and newlines, so it is generally fine just to strip them out.

Obviously this is not robust solution, but in cases I've seen, it works adequately. If one were to be doing it "the right way" then I agree with you wholeheartedly.

link

kjkjadksj 746 days ago

Only if you anticipate your data having commas in fields, which is somewhat rare in my experience (as in I’ve never seen it at all).

link

setopt 746 days ago

Comma is the decimal separator in many European languages. It’s also not uncommon in strings.

link

kjkjadksj 746 days ago

Pandas can handle that with a flag

link

setopt 745 days ago

Sure. Another option is semicolon-delimited files which are also in use in Europe, and Pandas handles that fine too.

I was responding to your comment that you had never seen commas in data fields in CSV files, and wanted to point out that this is a quite common issue in Europe.

(It also often wreaks havoc with Excel files btw, as Excel will then only casts strings to decimal numbers when a file is opened in some locales...)

link