| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by createmyaccount 4078 days ago
	Amongst other things this can be used for cleaning tables/lists from special characters, changing date formats and creating xml or json. Feedback and suggestions are very much welcome! We plan on adding a few more features soon as right now it is fairly basic but would like to hear some opinions and see if there's people out there that have a use for this.

2 comments

thehoff 4078 days ago

This is really neat! Any chance this will be a cli tool or module/library?

It doesn't seem to play will with something like this as an input:

    3, Roberto/Carlos, soccer, Brazil
    35, Roberto/Carlos Michael Jordan, baseball, USA
    6, Roberto/Carlos James Lebron, basketball, USA
    10, Roberto/Carlos Shinji Kagawa, soccer, Japan

Format:

    3, ROBERTO/CARLOS, soccer, Brazil

Gives me:

    3, ROBERTO/CARLOS, soccer, Brazil
    35, ROBERTO/CARLOS, Michael, Jordan
    6, ROBERTO/CARLOS, James, Lebron
    10, ROBERTO/CARLOS, Shinji, Kagawa

I can't seem to find a way to get it to parse that out properly (playing with the ROBERTO/CARLOS part.

I even tried this as an input:

    3, Roberto Carlos, soccer, Brazil
    35, Roberto Carlos Michael Jordan, baseball, USA
    6, Roberto Carlos James Lebron, basketball, USA
    10, Roberto Carlos Shinji Kagawa, soccer, Japan

Format:

    3, ROBERTO CARLOS, soccer, Brazil

Gives me:

    3, ROBERTO CARLOS, soccer, Brazil
    35, ROBERTO CARLOS, Michael, Jordan
    6, ROBERTO CARLOS, James, Lebron
    10, ROBERTO CARLOS, Shinji, Kagawa

Edit: format

link

sp332 4078 days ago

My brain doesn't understand what you're trying to do either. Why is Roberto/Carlos on every line?

link

thehoff 4078 days ago

I formatted their examples to appear like some real data I have that appears like this, obviously not names but descriptions of some projects. I was curious how this would handle it.

In any case, get rid of the "/" and its closer to real. Some people have more than two names in their full name. And on a set a little larger you could very well have something close to my second example.

link

createmyaccount 4078 days ago

Currently it matches word by word, so for example if someone has a family name in two parts like "Van Buyten", it wont work. I think it's the same problem in your example: that the first "column" contains multiple words in some cases? We'll be fixing this in a future release!

link

jschwartzi 4077 days ago

I thought that a cli version might be useful too. The closest thing I have right now is sed/awk. Sed can do this kind of stuff but you have to specify a Regular Expression instead of a simple example. Because you have to be more specific about what you want, Sed will definitely handle those examples, with the caveat that you have to tell Sed what it is that you want to substitute and where for each line.

http://linux.die.net/abs-guide/x19673.html

It took me about a year of use before I could figure out how to munge lines in it, so it's definitely not for the faint of heart. I use it for things like transforming excel spreadsheets into C struct arrays.

link

martinthenext 4076 days ago

I was curious and sketched up something similar to this website in about a 100 lines of Python code. It has a CLI interface, have a look if you're interested:

https://gist.github.com/martinthenext/fc989ffa6ec84ee09962

link

fastball 4077 days ago

Why would it work? Your data isn't even isomorphic.

On the first line, you have "Roberto Carlos" followed immediately by a comma. On subsequent lines you have Roberto Carlos followed by two other names.

Also your example works fine if you use a different delimiter for your format vs for your input, e.g.

  3 | ROBERTO CARLOS | SOCCER | Brazil

link

cjgk 4078 days ago

AWK is the best!

http://en.wikipedia.org/wiki/AWK

link

pbhjpbhj 4077 days ago

Given this "35, Roberto/Carlos Michael Jordan, baseball, USA" tuple what are you expecting as output?

link

thehoff 4077 days ago

I was expecting that as the output but got: 35, ROBERTO/CARLOS, Michael, Jordan

See other response as it works on word by word. Here is (hopefully) a better example:

Input:

    35, Billy Jean, soccer, USA
    29, Billy Jo Jean, football, USA

Transform-at:

    35, soccer, Billy Jean, USA

You'll get:

    35, soccer, Billy Jean, USA
    29, Jean, Billy Jo, football

But I was expecting:

    35, soccer, Billy Jean, USA
    29, football, Billy Jo Jean, USA

link

pbhjpbhj 4077 days ago

Right, understand now, it's using words as atoms rather than breaking fields at the commas and using them.

link

TuringTest 4078 days ago

Does it work by a general library for learning patterns, or as a collection of heuristics to match common cases?

That difference would change the kind of examples I would try it with, knowing in advance when they're too complex to work.

link

kogrem 4077 days ago

Currently there's no real machine learning or advanced pattern matching involved. But we're certainly working towards that!

link