Hacker News new | ask | show | jobs
by thehoff 4071 days ago
This is really neat! Any chance this will be a cli tool or module/library?

It doesn't seem to play will with something like this as an input:

    3, Roberto/Carlos, soccer, Brazil
    35, Roberto/Carlos Michael Jordan, baseball, USA
    6, Roberto/Carlos James Lebron, basketball, USA
    10, Roberto/Carlos Shinji Kagawa, soccer, Japan
Format:

    3, ROBERTO/CARLOS, soccer, Brazil
Gives me:

    3, ROBERTO/CARLOS, soccer, Brazil
    35, ROBERTO/CARLOS, Michael, Jordan
    6, ROBERTO/CARLOS, James, Lebron
    10, ROBERTO/CARLOS, Shinji, Kagawa
I can't seem to find a way to get it to parse that out properly (playing with the ROBERTO/CARLOS part.

I even tried this as an input:

    3, Roberto Carlos, soccer, Brazil
    35, Roberto Carlos Michael Jordan, baseball, USA
    6, Roberto Carlos James Lebron, basketball, USA
    10, Roberto Carlos Shinji Kagawa, soccer, Japan
Format:

    3, ROBERTO CARLOS, soccer, Brazil
Gives me:

    3, ROBERTO CARLOS, soccer, Brazil
    35, ROBERTO CARLOS, Michael, Jordan
    6, ROBERTO CARLOS, James, Lebron
    10, ROBERTO CARLOS, Shinji, Kagawa
Edit: format
6 comments

My brain doesn't understand what you're trying to do either. Why is Roberto/Carlos on every line?
I formatted their examples to appear like some real data I have that appears like this, obviously not names but descriptions of some projects. I was curious how this would handle it.

In any case, get rid of the "/" and its closer to real. Some people have more than two names in their full name. And on a set a little larger you could very well have something close to my second example.

Currently it matches word by word, so for example if someone has a family name in two parts like "Van Buyten", it wont work. I think it's the same problem in your example: that the first "column" contains multiple words in some cases? We'll be fixing this in a future release!
I thought that a cli version might be useful too. The closest thing I have right now is sed/awk. Sed can do this kind of stuff but you have to specify a Regular Expression instead of a simple example. Because you have to be more specific about what you want, Sed will definitely handle those examples, with the caveat that you have to tell Sed what it is that you want to substitute and where for each line.

http://linux.die.net/abs-guide/x19673.html

It took me about a year of use before I could figure out how to munge lines in it, so it's definitely not for the faint of heart. I use it for things like transforming excel spreadsheets into C struct arrays.

I was curious and sketched up something similar to this website in about a 100 lines of Python code. It has a CLI interface, have a look if you're interested:

https://gist.github.com/martinthenext/fc989ffa6ec84ee09962

Why would it work? Your data isn't even isomorphic.

On the first line, you have "Roberto Carlos" followed immediately by a comma. On subsequent lines you have Roberto Carlos followed by two other names.

Also your example works fine if you use a different delimiter for your format vs for your input, e.g.

  3 | ROBERTO CARLOS | SOCCER | Brazil
Given this "35, Roberto/Carlos Michael Jordan, baseball, USA" tuple what are you expecting as output?
I was expecting that as the output but got: 35, ROBERTO/CARLOS, Michael, Jordan

See other response as it works on word by word. Here is (hopefully) a better example:

Input:

    35, Billy Jean, soccer, USA
    29, Billy Jo Jean, football, USA
Transform-at:

    35, soccer, Billy Jean, USA
You'll get:

    35, soccer, Billy Jean, USA
    29, Jean, Billy Jo, football
But I was expecting:

    35, soccer, Billy Jean, USA
    29, football, Billy Jo Jean, USA
Right, understand now, it's using words as atoms rather than breaking fields at the commas and using them.