| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SahAssar 746 days ago
	If you propose to use a TSV parser that does not handle escaping then that sounds very unsafe to me. Do you also want to skip checking for escaped newlines? Or escaped backslashes? What you are proposing is not using TSV, but a format that completely bans newlines or tabs in any data. There are certainly uses for such a format but to make it non-risky to use you'd need strict validation on the input to the encoder and make it very clear that it is not TSV, since it does not follow the rules of TSV encoding/decoding and will not produce the same data as a proper TSV implementation.

1 comments

RadiozRadioz 746 days ago

We're coming at this from different angles. I completely agree with you that the proper way to read these files is using a fully standards-compliant parser. You make the distinction that a parser that can't handle tabs in the data doesn't technically parse "TSV", instead a subset of TSV-like files with limitations - sure, that makes sense.

What I'm trying to get at, is that there are situations in which implementing such a limited parser is justifiable (and for the main discussion in this thread, TSV makes this more commonly achievable than CSV).

With the luxury of time, all our parsers would handle delimiter escaping, unicode, control characters, byte order marks, etc, perfectly, and truly parse "TSV" and "CSV. Personally, I work on-call in SRE - if something is broken, we need solutions NOW. If I have a CSV of stuff, I am not going to implement a proper parser, I don't even have time to boot up a programming language with a CSV library, I am going to split by comma in the terminal of whatever box I'm logged into to get what I need. Most of the time it'll work, and to the discussion in the thread, TSV makes it more likely to work because it's less likely for the delimiter to be in the data. Less likely to need need those 5-6 extra characters of regex lookbehind.

My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them. Sometimes the data's really messy and I need a real parser, but TSV makes this less likely.

link

SahAssar 746 days ago

> My main point: as a consumer of these files, I prefer it when people send me TSVs rather than CSVs, because I am more likely to be able to use a simple not-really-TSV/CSV parser to read them

My point is that you are not really talking about CSV/TSV since your parser does not handle CSV/TSV. You are using a custom dataformat. Which is fine and perfectly reasonable, and its probably specified to avoid all those issues.

But it is not CSV or TSV. When you say "a simple not-really-TSV/CSV parser to read them" you mean you are not using CSV or TSV. That's fine for non-CSV and non-TSV. usage. Just be clear about what format you are actually using and specify it. It clearly isn't TSV or CSV.

link

RadiozRadioz 745 days ago

Thanks for the explanation. Ah, I think I see where our difference is.

A website produces a file with a normal CSV exporter. This is a fully standards compliant proper CSV. I call this a CSV. They provide this file for download, I download it unchanged. By this point, I still call the file a CSV.

Next, I parse the CSV file with my non-CSV parser. Here's our point of contention: I still think the original file is a CSV; I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.

I was thinking from the perspective of the file itself and where it came from, so using an incorrect parser doesn't change it. You were thinking in terms of the grammar accepted by the parser I'm using - assuming the parser is appropriate, it's impossible for me to be reading a CSV, it must be something else CSV-like.

I think we are both right, and I think we both understand where the other is coming from.

link

SahAssar 744 days ago

> I have operated upon it with a non-CSV parser, but for my way of thinking, the file itself is still a CSV. You disagree here, because in order for my use of the parser to be correct, I can't possibly have operated upon a CSV file, I must have operated on a CSV-like file.

Not quite my opinion. The file is still a CSV file, but IMO the parser is not a CSV parser unless it supports the full spec. The file is still CSV, and it happens to be compatible with the incomplete parser because it does not use any "harder" CSV features.

Lets say we have a website that uses UTF-8 (declared via content-encoding and similar). Some pages on this website only uses ASCII, some uses higher codepoints within UTF-8.

I can parse some of these pages with a ASCII decoder, but that does not mean that my ASCII decoder is a UTF-8 decoder since it only handles a very small subset of UTF-8 that aligns with ASCII. In this example your CSV-lite would be like ASCII and CSV would be UTF-8.

link

RadiozRadioz 743 days ago

I completely understand the concept, I really do. I'm just struggling to work out where the original disagreement came from, I think it's completely my fault for not articulating myself properly, thank you for your patience. I'm going to annotate my original comment here with clarification on what I originally meant:

Because, while you always _should_ implement the proper escaping [in order to extract the information you need from CSV/TSV files that you have received from an external source that produces correctly-formatted CSV/TSV files], that takes [human effort]. Not a large amount of [human effort], but more than zero. In many cases [the data stored in CSV/TSV representation] doesn't contain commas or tabs, so you can [extract the data from the file] the super simple way [by implementing a naiive CSV/TSV-like parser that just happens to work for a subset of CSV files that don't contain escaping] and get back that time. [In doing this, you have extracted the information you need from the file, but you have not done the work to implement a real CSV/TSV parser. You have implemented a parser for a mystery format, misused it on a CSV/TSV file, but it happened to work and you got the data you needed]. There are more cases where data [in the CSV/TSV file you got from an external source] is tabless than commaless, so [if the external source happens to provide you a TSV file instead of a CSV file, this] affords you more opportunities to [be able to misuse use your TSV-like parser on the TSV file and still get the data you need, giving you a] quick and dirty timesave when you need [an immediate] solution [where you lack the time to get a real CSV/TSV parser and can tolerate the inherent lack of safety in using a CSV/TSV-like parser on a CSV/TSV].

link

SahAssar 743 days ago

> I think it's completely my fault for not articulating myself properly, thank you for your patience.

Not at all, and thank you for your patience and engaging this deep!

I think the disagreement came from different people. I initially tried to question why /u/guidedlight thought a certain delimiter would be easier/simpler just because it was less common and then we went down a rabbit hole of "what is a CSV/TSV".

I agree with you and I think there are many use-cases for not-as-full CSV/TSV parsers/encoders. My main objection was calling a CSV/TSV implementation a CSV/TSV implementation when it clearly skipped a lot of parts (while it can of course parse a lot of files without those parts). I'd like to call those simpler formats something other than CSV/TSV, but that ship has sailed.

So I think it seems like we are on the same page.

link