Hacker News new | ask | show | jobs
by coolj 4322 days ago
The regexp must have taken some time to write and debug:

s/^."n":"(.+)","i":"(.+)","p":\[([\d\.\-]+),([\d\.\-]+)],"s":"(\w+)","c":"(.+)".$/\1,\2,\3,\4,\5/;

4 comments

It's actually pretty easy and quick to write when the data is highly regular (no pun intended!). You can write this kind of expression by taking one of the lines of input:

   {"n":"Homewood","i":"inns_suits","p":[33.455237,-86.81964],"s":"AL","c":"1"},
and then making a regular expression that matches that line literally [1]:

   m/^{"n":"Homewood","i":"inns_suits","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
     _                                     _                    _
Then replace the parts that will vary with regular expressions to capture them. We want to capture the "n" field:

   m/^{"n":"(.*?)","i":"inns_suits","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
            _____
and the "i" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
                        _____
and the longitude and latitudes from the "p" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"AL","c":"1"},/
                                     _____ _____
and the "s" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"(.*?)","c":"1"},/
                                                        _____
We don't care about the "c" field, so I'm going to drop it:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"(.*?)"/
If we want to be fancy, we can make sure that the latitude and longitude consist only of digits, decimal points, and minus signs:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[([\d.-]*?),([\d.-]*?)\],"s":"(.*?)"/
                                       ____       ____
For a one time thing like this, I'd probably deal with this data with a pipe in the shell, rather than use regular expressions:

   tr : , < in | tr -d '[]' | cut -d , -f 2,4,6,7,9 > out.csv
 
[1] I shall use Perl regular expression
I've been a Perl programmer and regular expression user for over a decade. It didn't take long and if you use them regularly, it's second nature. I agree, I could have used a JSON parser, but the source was well-structured and sed seemed easier. PostgreSQL can import JSON, but I haven't had a good experience with it so far.
You put together a really awesome project! Kudos on that and for sharing it first and foremost!! I speak regex pretty fluently, and I may very well have reached for sed/perl -ne there as well, depending on mood. I guess it just struck me as odd to use a parser for XML but not JSON. Anyhow, off on a tangent, I recently learned about a program called xgrep that lets you pull elements out of XML using xpath or pcre (althought pcre support was kind of spotty for me)--it's neat for one-offs like this and pipeline building while playing with rest apis and such!
He could also have had that sitting around from a recent project.. it doesn't really seem worth second-guessing.
Maybe, if you're doing it all at once. I often use my text editor as a regex platform, using the Find/Replace menu to dice off chunks of text in a series of quick and easy operations, using UNDO if I get one wrong. It's really very quick if you don't force yourself to do it in one step.