| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zAy0LfpBZLC8mAC 4447 days ago
	What would happen is that you first would replace #COMMA, with #COMMA#COMMA# and then later replace that with ,COMMA# , thus garbling the data. The way to make the data accessible is to request the producer to be fixed, it's that simple. If that is completely impossible, you'll have to figure out the grammar of the data that you actually have and build a parser for that. Your suggested strategy does not work.

2 comments

dbro 4447 days ago

Usually the person parsing the CSV data doesn't have control over the way the data gets written. If he did, he would probably prefer something like protocol buffers. CSV is the lowest common denominator, so it's a useful format for exchanging data between different organizations that are producing and consuming the data.

https://github.com/dbro/csvquote is a small and fast script that can replace ambiguous separators (commas and newlines, for example) inside quoted fields, so that other text tools can work with a simple grammar. After that work is done, the ambiguous commas inside quoted fields get restored. I wrote it to use unix shell tools like cut, awk, ... with CSV files containing millions of records.

link

zAy0LfpBZLC8mAC 4445 days ago

You tend to have more control over the way the data is produced than you think, and you should make use of it. It's idiotic to work around broken producers over and over and over again, each time with a high risk of introducing some bugs, instead of pushing back and getting the producer fixed once and for all. Often the problem is simply in the perception that somehow broken output is just "not quite right", and therefore nothing to make a fuss about. That's not how reliable data processing works. You have a formal grammar, and either your data conforms to it or it does not, and if it doesn't, good software should simply reject it.

Your csvquote is something completely different, though it seems like you yourself might be confused about what it actually is when you use the word "ambiguous". There is nothing ambiguous about commas and newlines in CSV fields. If it were, that would be a bug in the grammar. It just so happens that many unix shell tools cannot handle CSV files in any meaningful way, because that is not their input grammar. Now, what your csvquote actually does is that it translates between CSV and a format that is compatible with that input grammar on some level, in a reversible manner. The thing to recognize is that that format is _not_ CSV and that you are actually parsing the input according to CSV grammar, so that the translation is actually reversible. Such a conversion between formats is obviously perfectly fine - as long as you can prove that the conversion is reversible, that the round-trip is the identity function, that the processing you do on the converted data is actually isomorphic to what you conceptually want to do, and so on.

BTW, I suspect that that code would be quite a bit faster if you didn't use a function pointer in that way and/or made the functions static. I haven't tried what compilers do with it, but chances are they keep that pointer call in the inner loop, which would be terribly slow. Also, you might want to review your error checking, there are quite a few opportunities for errors to go undetected, thus silently corrupting data.

link

lignuist 4447 days ago

I used that strategy for parsing gigabytes of CSVs containing arbitrary natural language from the web - try to get these files fixed, or figure out a grammar for gigabytes of fuzzy data...

My approach never failed for me, so telling me that my strategy does not work is a strong claim, where it reliably did the job for me.

Your examples are all valid, but what you are describing are theoretical attacks on the method, while the method works in almost all cases in practice. We are talking about two different viewpoints: dealing with large amounts of messy data on one hand and parser theory in an ideal cosmos on the other hand.

link

zAy0LfpBZLC8mAC 4447 days ago

How do you know that the strategy worked reliably if you never compared the results to the results obtained using a reliable method (which you presumably didn't, because then you could just have used the reliable method)? The larger the data you have to deal with, the more likely it is that corner cases will occur in it, and the less likely that you will notice anomalies, thus the more important that you are very strict in your logic if you want to derive any meaningful results.

As such, the two viewpoints really are: not really caring about the soundness of your results and solving the actual problem.

Now, maybe you really can show that the bugs in the methods you use only cause negligible noise in your results, in which case it might be perfectly fine to use those methods. But just ignoring errors in your deduction process because you don't feel like doing the work of actually solving the problem at hand is not pragmatism. You'll have to at least demonstrate that your approach does not invalidate the result.

link

lignuist 4447 days ago

Nitpicking much?

As I wrote above, by making sure that I use a placeholder that does not appear in the data, I make sure that it does not cause the issues you describe. And if I was wrong with that assumption, I can at least minimize the effect by choosing a very unlikely sequence as placeholder.

I really see no issue here. How do you find valid grammars for fuzzy data in practice?

link