| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by coolj 4322 days ago
	The regexp must have taken some time to write and debug: s/^."n":"(.+)","i":"(.+)","p":\[([\d\.\-]+),([\d\.\-]+)],"s":"(\w+)","c":"(.+)".$/\1,\2,\3,\4,\5/;

4 comments

tzs 4322 days ago

It's actually pretty easy and quick to write when the data is highly regular (no pun intended!). You can write this kind of expression by taking one of the lines of input:

   {"n":"Homewood","i":"inns_suits","p":[33.455237,-86.81964],"s":"AL","c":"1"},

and then making a regular expression that matches that line literally [1]:

   m/^{"n":"Homewood","i":"inns_suits","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
     _                                     _                    _

Then replace the parts that will vary with regular expressions to capture them. We want to capture the "n" field:

   m/^{"n":"(.*?)","i":"inns_suits","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
            _____

and the "i" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[33.455237,-86.81964\],"s":"AL","c":"1"},/
                        _____

and the longitude and latitudes from the "p" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"AL","c":"1"},/
                                     _____ _____

and the "s" field:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"(.*?)","c":"1"},/
                                                        _____

We don't care about the "c" field, so I'm going to drop it:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[(.*?),(.*?)\],"s":"(.*?)"/

If we want to be fancy, we can make sure that the latitude and longitude consist only of digits, decimal points, and minus signs:

   m/^{"n":"(.*?)","i":"(.*?)","p":\[([\d.-]*?),([\d.-]*?)\],"s":"(.*?)"/
                                       ____       ____

For a one time thing like this, I'd probably deal with this data with a pipe in the shell, rather than use regular expressions:

   tr : , < in | tr -d '[]' | cut -d , -f 2,4,6,7,9 > out.csv

[1] I shall use Perl regular expression

link

johnjreiser 4321 days ago

I've been a Perl programmer and regular expression user for over a decade. It didn't take long and if you use them regularly, it's second nature. I agree, I could have used a JSON parser, but the source was well-structured and sed seemed easier. PostgreSQL can import JSON, but I haven't had a good experience with it so far.

link

coolj 4321 days ago

You put together a really awesome project! Kudos on that and for sharing it first and foremost!! I speak regex pretty fluently, and I may very well have reached for sed/perl -ne there as well, depending on mood. I guess it just struck me as odd to use a parser for XML but not JSON. Anyhow, off on a tangent, I recently learned about a program called xgrep that lets you pull elements out of XML using xpath or pcre (althought pcre support was kind of spotty for me)--it's neat for one-offs like this and pipeline building while playing with rest apis and such!

link

fenomas 4322 days ago

He could also have had that sitting around from a recent project.. it doesn't really seem worth second-guessing.

link

knodi123 4321 days ago

Maybe, if you're doing it all at once. I often use my text editor as a regex platform, using the Find/Replace menu to dice off chunks of text in a series of quick and easy operations, using UNDO if I get one wrong. It's really very quick if you don't force yourself to do it in one step.

link