Hacker News new | ask | show | jobs
by xamuel 805 days ago
Very nice. The submission's main file (SmallestCSVParser.cs) is 3851 characters (of which 657 are commentary).

Mine, in C, is only 2807 characters (of which 198 are commentary):

https://github.com/semitrivial/csv_parser/blob/master/csv.c

Ahh, but the submission's main file is for parsing an entire .csv, whereas mine is only for parsing a single "line" (possibly including quote-escaped newlines). So the submission wins :)

2 comments

Do you need more than 1k of additional source to wrap the line parsing in a loop? I doubt it.
Linebreaks can be escaped in CSV, so splitting a file into rows is actually ~1/3 the complexity of parsing a whole row.

See: https://github.com/semitrivial/csv_parser/blob/master/split....

Though I suppose that's the naive approach. You could combine the two into a single file by, like you say, wrapping the row-parser in a (clever, non-trivial) outer loop, and it probably wouldn't take anywhere near 1000 characters to do that...

> Linebreaks can be escaped in CSV

In some variants of CSV. There isn’t agreement on the format. For example, https://www.ietf.org/rfc/rfc4180.txt says

“While there are various specifications and implementations for the CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.”

That RFC doesn’t even agree with itself, saying

“1. Each record is located on a separate line, delimited by a line break (CRLF).”

but then following that up with:

“6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes”

There is no contradiction, as (1) does not say that "each record is located on a (exactly one) single separate line". But it could be better phrased, like "two consecutive records are separated by...".

This is the "mathematical a", which does not mean "exactly one" but "at least one, and we don't care how many, we already did the interesting work". Like in "this problem has a solution".

Parse the whole file in one go. You need to track opening and closing quotes (and escaped ones) anyway, so there is no need to distinguish between commas (semicolons, tabs) and newlines.

Btw. you are not handling escaped double quotes in strings at all, and if you would do that you'd also need to count the number of backspaces. Oh, and no need to escape double quotes in single quotes ('\" could just be '"').

Back in the days (2008) i created an Autohotkey v1 function for parsing a delimiter seperated line called ReturnDSVArray

it can be found here: https://www.autohotkey.com/board/topic/30102-how-can-i-parse...

it consists of some 30ish lines of code and 67 lines with comments and usage example