Hacker News new | ask | show | jobs
by Natsu 1900 days ago
Possible, but there are also different files with different schemas, so it's hard to even say that.

There only ones that actually define the data are the 9 or so CSV files that have a header like:

id,phone,first_name,last_name,email,birthday,gender,locale,hometown,location,link

Those are what I looked at and those are super annoying because several have commas in both the first & last name. I don't know why, but a handful of people listed their names as some, guy, some, guy which I assume should be split into firstname: some, guy and lastname: some, guy. Then a lot of people have None for a birthday, some have something like May 8, and others have something like May 8, 1990. Both locale & hometown can be either None, or have several commas in them.

I had to reformat all that data and validate that each field made sense to parse it. There are helpful "Location" and "link" markers in the CSV but it's still super annoying to parse this stuff.