Hacker News new | ask | show | jobs
by hacker_9 2818 days ago
I'm sorry but there is nothing new here? This seems like a backwards step if anything.

Usually when web scraping, I can just load in the HtmlAgilityPack (c#), point it at a URL then write some functional code to extract the necessary data.

Even better, I'll examine the website in Fiddler and hope they have a data-view separation going on, and be able to just intercept the json file they load instead.

Worse case scenario I need to dynamically click on buttons etc, but this can usually be handled by selenium, or if they detect that just roll a custom implementation of CefSharp (again not hard, just download the nuget, and it lets you run your own custom javascript).

A new, more limited, language (with no IDE tools) is not the way to go. If anything a better web scraper just make the above processes I mentioned more seamless, for example combining finding/selecting of elements in chrome with codegen.

2 comments

The main advantage of this over your approach with HtmlAgilityPack is that Ferret can handle dynamic web pages - those that are rendered with JS. And also, it can emulate user interactions. But anyway, thanks for your feedback :)
I think AngleSharp can handle JS and it's not that different from HtmlAgilityPack. https://github.com/AngleSharp/AngleSharp
14th of July this year: https://github.com/AngleSharp/AngleSharp/issues/693. I'm not sure there's much JS support.
The code for doing this isn't too difficult with https://github.com/chromedp/chromedp, is this just some helpers around that? I haven't used it or puppeteer on the node side that heavily, but what have you found difficult that deserves this kind of wrapper/abstraction instead of direct library use?
Right but in that case that implies a view-model separation, so you can usually just access the data file directly, which is usually json.
I'm sorry I do not fully understand what you mean. Imagine, that you need to grab some data from SoundCloud and, also imagine, they do not have a public API :) How would you do it without launching a browser?
You just need to look at the packets in Fiddler as the page loads, find the request that gets the data you want, and then clone that request in your application.

I just took a look at soundcloud and was able to get the data within a minute, as it's a basic setup. They use a json structure (most websites do) [1], and the data is fetched with a simply GET request.

If the website requires authentication, you just need to clone those requests too, and then you'll get some sort of cookie/session id back which you attach to any future requests.

As a bonus: You can then throw the json into a code converter online too which will convert your structure to a Type as well (useful if you are using a static language).

[1] https://imgur.com/a/Einoxdg

This is so 2010...websites now use a mix of serverside and client side. You can't just watch http requests and figure out the "api" specifications. Just try to scarpe adwords/gmail/amazon etc. Consider that they may also use anti-scarping code on the client.
Even without a documented API, there is often one at play using websites. Example: Website is available at "http://example.com/books/1234". When loading it, you see that it fires a request to "http://example.com/api/booksdata/1234" to load the data that popluates the page. So now you don't have to use a slow browser that loads everything, but can just use your normal http client (for all the ids you know).
That's true. You can do it, of course. I'm not saying that this is the only way of doing it.
I think what's being noted is that when the data comes as a data structure on the page, or as data passed back from an XHR request, you can just use that data directly and there's less page scraping to be done. This is generally how dynamic pages are created, out of a shipped data structure and rules to create the page out of it. If you have the data structure, it's generally much easier to parse than the page generated from it.

That said, for pages that use a background request to fetch the data, this can be useful, as that data used to build the page isn't always kept around as a data structure (at least not one easily accessible) afterwards. That is if accessing the endpoint of the background request isn't feasible for some reason.

He means that you look at the private API and use that.
> I'll examine the website in Fiddler

Isn't that overkill compared to just using the built-in browser devtools Network tab?

I find the Fiddler UI easy to use, plus you can add plugins such as converting the request straight to code. Up to you what tool you use of course.
Both Firefox and Chrome have a Copy As Curl feature which is really useful. I agree with you though, the UI sucks.