| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lignuist 4403 days ago
	You can replace all commas with a placeholder (e.g. "#COMMA#"), replace the delimiter with a comma, parse the document and then replace all placeholders in the data with ",".

1 comments

Someone 4403 days ago

That does not work, unless that first replacement magically ignores the commas that are part of field separators. If you know how to write the code that does that, your problem is solved.

link

lignuist 4403 days ago

I was referencing to "What if the character separating fields is not a comma?".

And there it clearly works. I used this technique a few times with success. If you find a CSV file that has mixed field separator types, then you probably found a broken CSV file.

link

zAy0LfpBZLC8mAC 4403 days ago

No, it doesn't. What if there is #COMMA# in one of the fields?

link

lignuist 4403 days ago

You just choose a placeholder that does not appear in the data. You could even implement it in a way that a placeholder is automatically selected upfront that does not appear in the data.

When it comes to parsing, the thing is that you usually have to make some assumptions about the document structure.

link

zAy0LfpBZLC8mAC 4403 days ago

What if there is #COMMA, in one of the fields (but no #COMMA#)?

Yes, the assumption you have to make is called the grammar, and you better have a parser that always does what the grammar says, and global text replacement is a technique that is easy to get wrong, difficult to prove correct, and completely unnecessary at that.

link

lignuist 4403 days ago

> What if there is #COMMA, in one of the fields (but no #COMMA#)?

What should happen? Since #COMMA is not #COMMA#, it gets not replaced, because it does not match.

Please keep in mind, that I replied to suni's very specific question and did not try to start a discussion about general parser theory. In practice, we find a lot of files that do not respect the grammar, but still need to find a way to make the data accessible.

link