| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sephware 2819 days ago

Why is this a language instead of a library on top of an existing language?

Here's what it would look like as a JavaScript (Node.js or browser) library:

    let g = getDocument("https://www.google.com/", true);
    
    g.input('input[name="q"]', "ferret");
    g.click('input[name="btnK"]');
    
    g.waitNavigation();
    
    let result = g.elements('.g').map(({
      title: result.element('h3 > a'),
      description: result.element('.st'),
      url: result.element('cite')
    }));
    
    return result.filter(i => i.title !== null);

5 comments

brixon 2819 days ago

John Hammond: I don't think you're giving us our due credit. Our scientists have done things which nobody's ever done before...

Ian Malcolm: Yeah, yeah, but your scientists were so preoccupied with whether or not they could, that they didn't stop to think if they should.

link

RussianCow 2819 days ago

I'm not sure why you're being downvoted. As far as I can tell from the examples, there is nothing that this language brings to the table that couldn't be implemented instead as an API on top of an existing language.

link

existencebox 2819 days ago

The "funny" part is, while reading this, I kept thinking of a similar tool I built myself as a wrapper around python+beautifulsoup. Definitely parts of what the OP needed were compelling to me (I found regular subpatterns in my scraping work that could be encapsulated really well in certain bits of syntactic sugar, but concluded that a minimal json-like structure for defining a scrape was both sufficient and let me have a graphical "scrape builder" in a UX far more readily than if I actually wrote the scrape as code.

There's the usual amount of HN cynicism in the thread which I'm not sure is 200% off mark, but I think there are some good concepts in "Scraping primitives" that can be contemplated that the OP took an interesting angle on. (or rather; not the angle I took, so interesting to me.)

link

theboat 2819 days ago

Is your beautifulsoup wrapper open source?

link

existencebox 2819 days ago

If other people would find it beneficial it can be; I had admittedly seen it as "just a mess of sugar to help me scrape easier" (with all the typical nervousness of showing others "imperfect" code). I'm aptly in the process of cleaning it up, writing tests and finishing an MVP UI. I can do a show HN in a few weeks once I've found the time to get it ship-shape.

link

ziflex 2819 days ago

You definitely need to share. Web scraping is tedious. As more ideas we have, as more options we have to come up with a better solution for that.

link

ziflex 2819 days ago

That's true. The difference is how much efforts is needed to do that using API.

What it brings is just a higher abstraction of that API which lets you easily to get work done.

link

RussianCow 2819 days ago

Do you have a more involved example where Ferret really shines, as opposed to a library with a similar API in JS or another common language? I really don't mean to be negative, but I just don't see how Ferret is any easier to use than something like Nightmare[0]. That said, I'm wondering if it's an issue of communication more than anything, so maybe a different example than the one in the readme would help.

[0]: https://github.com/segmentio/nightmare

link

ziflex 2819 days ago

You are fine, I totally understand your scepticism. And you are right, there are definitely issues in communication.

First of all, I built it for myself. I needed a high level representation of scraping logic, which would run an isolated and safe environment. Second, I needed to be able easily scrape dynamic pages.

So, what I got is: - high level, declarative-is language, that hides all infrastructural details, which helps you to focus on the logic itself. that helps you to describe what you want without worring about underlying technology. Today, I'm using headless Chrome, tomorrow I will use something else, but the change should not affect your code. - full support of dynamic pages. You can get data from dynamically rendered page, emulate user's actions and etc. Heck, you can even write bots with it. - embeddable. now, I have only CLI, there are plans to write a web server where you can save your scripts, schedule them and set up output streams.

But the main idea is to provide high level declarative way of scraping the web. I'm not saying you can't do that with other tools. I'm just trying to come up with something more easy to work with.

Regarding examples, the project is still WIP, so as more complex features I get, more complex examples I get. Here is more or less complex, getting data from Google Search. It's not that difficult, but it showcases the core feature of work with dynamic pages.

https://github.com/MontFerret/ferret/blob/master/docs/exampl...

link

sephware 2819 days ago

"Much more effort"? Right now it implements a library and a language on top of that. Making it just be a library would cut the work in half.

link

ziflex 2819 days ago

The idea is to create a high level abstraction that represents your web scraping logic. The project is still WIP. I will create a web server which will help you to store your queries, schedule them and set up output stream to other systems like Spark and Flink.

link

anonytrary 2818 days ago

Javascript isn't in uppercase, therefore it is an inferior language to write queries in. The best, most concise solution here is to reinvent javascript in uppercase, then pass it off as a new QL.

link

xrd 2818 days ago

No one is responding to your comment BECAUSE IT ISN'T IN ALL CAPS TO MAKE THEM NOTICE.

link

ziflex 2819 days ago

The main purpose is to use scripts like SQL. Where you can write and modify your scripts for data without compilation. Plus, the project aims to simplify the process and hide technical complexity behind it. Moreover don't forget, that the system can work with dynamic pages which brings more complexity underneath.

And finally, you can use it as a library. It's totally embeddable.

link

lukeholder 2819 days ago

It is also a library for the Go language.

link