Hacker News new | ask | show | jobs
by Natsu 1902 days ago
I'm not surprised, the data dump was an ugly mess of inconsistently encoded data in inconsistent formats with "delimiters" that often appear in the data itself.

Cleaning that up is a serious effort and requires operations on huge files that are very difficult for most software to deal with.

1 comments

I don't understand how a serious effort would be required, even if the chosen delimiter being present within the data is an issue, the phone number is the first field.

I can get all the phone numbers myself with a simple `cat * | cut -d ":" -f 1`

That's the ID number you just grabbed. The phone number is the second field :)

If that's literally all you want, yes, it's not that hard. But a non-trivial number of people decided to put commas or colons in their names and other nonsense like that, there are lots of commas in the hometown or location fields which makes parsing those a pain, etc.

Aha we must be looking at different data then, possibly someones already done much of the corrections on the version I'm looking at.
Possible, but there are also different files with different schemas, so it's hard to even say that.

There only ones that actually define the data are the 9 or so CSV files that have a header like:

id,phone,first_name,last_name,email,birthday,gender,locale,hometown,location,link

Those are what I looked at and those are super annoying because several have commas in both the first & last name. I don't know why, but a handful of people listed their names as some, guy, some, guy which I assume should be split into firstname: some, guy and lastname: some, guy. Then a lot of people have None for a birthday, some have something like May 8, and others have something like May 8, 1990. Both locale & hometown can be either None, or have several commas in them.

I had to reformat all that data and validate that each field made sense to parse it. There are helpful "Location" and "link" markers in the CSV but it's still super annoying to parse this stuff.

Also be careful, some of these docs have BOMs that screw up parsing tools (even iconv crapped out on one of the files, Qatar I think it was) and the encoding is all over the place. At least the phone number is ASCII, but the names may be UTF-8 (with or without BOM), UTF-16-le or...