Hacker News new | ask | show | jobs
by h3ro 4495 days ago
I do this kind of stuff with wget, sed, awk so far, but it's nice to see some more thought-out alternatives.

What I like most about your competition though is the JS interface that gets used for one good last thing (before being properly scraped and de-AD- and de-java-fied): clicking on the content you want, and deselecting content you don't want: subtly, with your mouse you lead a pattern-matching algorithm doing the annoying work.

Honestly the simplicity of this interface is even more breathtaking to me than gargl :P But it's even more limited, as after clicking twice it thinks that it has understood the pattern already, although that might not be the case.

I'd suggest to integrate the idea, but to make the learning process more clever, make it possible to select more things, even though the engine thinks there can't be any more similar things. Give that AI more things to learn from. We want more identifiers than just counts and HTML elements: "2nd subelement of <h1>".

There's good stuff you can do with statistics, too. Some data exists only once, some exists only 3 times, some always exists over 10 times. That's valuable info. Some data has many words of whitespace seperated text - oh a paragraph!

tldr We need something that generates good semantics out of normal web sites automatically, so that users can use a simple Web UI mangled into the target web site to choose the right pattern.