Hacker News new | ask | show | jobs
by xamuel 805 days ago
Linebreaks can be escaped in CSV, so splitting a file into rows is actually ~1/3 the complexity of parsing a whole row.

See: https://github.com/semitrivial/csv_parser/blob/master/split....

Though I suppose that's the naive approach. You could combine the two into a single file by, like you say, wrapping the row-parser in a (clever, non-trivial) outer loop, and it probably wouldn't take anywhere near 1000 characters to do that...

2 comments

> Linebreaks can be escaped in CSV

In some variants of CSV. There isn’t agreement on the format. For example, https://www.ietf.org/rfc/rfc4180.txt says

“While there are various specifications and implementations for the CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.”

That RFC doesn’t even agree with itself, saying

“1. Each record is located on a separate line, delimited by a line break (CRLF).”

but then following that up with:

“6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes”

There is no contradiction, as (1) does not say that "each record is located on a (exactly one) single separate line". But it could be better phrased, like "two consecutive records are separated by...".

This is the "mathematical a", which does not mean "exactly one" but "at least one, and we don't care how many, we already did the interesting work". Like in "this problem has a solution".

Parse the whole file in one go. You need to track opening and closing quotes (and escaped ones) anyway, so there is no need to distinguish between commas (semicolons, tabs) and newlines.

Btw. you are not handling escaped double quotes in strings at all, and if you would do that you'd also need to count the number of backspaces. Oh, and no need to escape double quotes in single quotes ('\" could just be '"').