Hacker News new | ask | show | jobs
by createmyaccount 4078 days ago
Amongst other things this can be used for cleaning tables/lists from special characters, changing date formats and creating xml or json.

Feedback and suggestions are very much welcome! We plan on adding a few more features soon as right now it is fairly basic but would like to hear some opinions and see if there's people out there that have a use for this.

2 comments

This is really neat! Any chance this will be a cli tool or module/library?

It doesn't seem to play will with something like this as an input:

    3, Roberto/Carlos, soccer, Brazil
    35, Roberto/Carlos Michael Jordan, baseball, USA
    6, Roberto/Carlos James Lebron, basketball, USA
    10, Roberto/Carlos Shinji Kagawa, soccer, Japan
Format:

    3, ROBERTO/CARLOS, soccer, Brazil
Gives me:

    3, ROBERTO/CARLOS, soccer, Brazil
    35, ROBERTO/CARLOS, Michael, Jordan
    6, ROBERTO/CARLOS, James, Lebron
    10, ROBERTO/CARLOS, Shinji, Kagawa
I can't seem to find a way to get it to parse that out properly (playing with the ROBERTO/CARLOS part.

I even tried this as an input:

    3, Roberto Carlos, soccer, Brazil
    35, Roberto Carlos Michael Jordan, baseball, USA
    6, Roberto Carlos James Lebron, basketball, USA
    10, Roberto Carlos Shinji Kagawa, soccer, Japan
Format:

    3, ROBERTO CARLOS, soccer, Brazil
Gives me:

    3, ROBERTO CARLOS, soccer, Brazil
    35, ROBERTO CARLOS, Michael, Jordan
    6, ROBERTO CARLOS, James, Lebron
    10, ROBERTO CARLOS, Shinji, Kagawa
Edit: format
My brain doesn't understand what you're trying to do either. Why is Roberto/Carlos on every line?
I formatted their examples to appear like some real data I have that appears like this, obviously not names but descriptions of some projects. I was curious how this would handle it.

In any case, get rid of the "/" and its closer to real. Some people have more than two names in their full name. And on a set a little larger you could very well have something close to my second example.

Currently it matches word by word, so for example if someone has a family name in two parts like "Van Buyten", it wont work. I think it's the same problem in your example: that the first "column" contains multiple words in some cases? We'll be fixing this in a future release!
I thought that a cli version might be useful too. The closest thing I have right now is sed/awk. Sed can do this kind of stuff but you have to specify a Regular Expression instead of a simple example. Because you have to be more specific about what you want, Sed will definitely handle those examples, with the caveat that you have to tell Sed what it is that you want to substitute and where for each line.

http://linux.die.net/abs-guide/x19673.html

It took me about a year of use before I could figure out how to munge lines in it, so it's definitely not for the faint of heart. I use it for things like transforming excel spreadsheets into C struct arrays.

I was curious and sketched up something similar to this website in about a 100 lines of Python code. It has a CLI interface, have a look if you're interested:

https://gist.github.com/martinthenext/fc989ffa6ec84ee09962

Why would it work? Your data isn't even isomorphic.

On the first line, you have "Roberto Carlos" followed immediately by a comma. On subsequent lines you have Roberto Carlos followed by two other names.

Also your example works fine if you use a different delimiter for your format vs for your input, e.g.

  3 | ROBERTO CARLOS | SOCCER | Brazil
Given this "35, Roberto/Carlos Michael Jordan, baseball, USA" tuple what are you expecting as output?
I was expecting that as the output but got: 35, ROBERTO/CARLOS, Michael, Jordan

See other response as it works on word by word. Here is (hopefully) a better example:

Input:

    35, Billy Jean, soccer, USA
    29, Billy Jo Jean, football, USA
Transform-at:

    35, soccer, Billy Jean, USA
You'll get:

    35, soccer, Billy Jean, USA
    29, Jean, Billy Jo, football
But I was expecting:

    35, soccer, Billy Jean, USA
    29, football, Billy Jo Jean, USA
Right, understand now, it's using words as atoms rather than breaking fields at the commas and using them.
Does it work by a general library for learning patterns, or as a collection of heuristics to match common cases?

That difference would change the kind of examples I would try it with, knowing in advance when they're too complex to work.

Currently there's no real machine learning or advanced pattern matching involved. But we're certainly working towards that!