Hacker News new | ask | show | jobs
by ziflex 2818 days ago
The main advantage of this over your approach with HtmlAgilityPack is that Ferret can handle dynamic web pages - those that are rendered with JS. And also, it can emulate user interactions. But anyway, thanks for your feedback :)
3 comments

I think AngleSharp can handle JS and it's not that different from HtmlAgilityPack. https://github.com/AngleSharp/AngleSharp
14th of July this year: https://github.com/AngleSharp/AngleSharp/issues/693. I'm not sure there's much JS support.
The code for doing this isn't too difficult with https://github.com/chromedp/chromedp, is this just some helpers around that? I haven't used it or puppeteer on the node side that heavily, but what have you found difficult that deserves this kind of wrapper/abstraction instead of direct library use?
Right but in that case that implies a view-model separation, so you can usually just access the data file directly, which is usually json.
I'm sorry I do not fully understand what you mean. Imagine, that you need to grab some data from SoundCloud and, also imagine, they do not have a public API :) How would you do it without launching a browser?
You just need to look at the packets in Fiddler as the page loads, find the request that gets the data you want, and then clone that request in your application.

I just took a look at soundcloud and was able to get the data within a minute, as it's a basic setup. They use a json structure (most websites do) [1], and the data is fetched with a simply GET request.

If the website requires authentication, you just need to clone those requests too, and then you'll get some sort of cookie/session id back which you attach to any future requests.

As a bonus: You can then throw the json into a code converter online too which will convert your structure to a Type as well (useful if you are using a static language).

[1] https://imgur.com/a/Einoxdg

This is so 2010...websites now use a mix of serverside and client side. You can't just watch http requests and figure out the "api" specifications. Just try to scarpe adwords/gmail/amazon etc. Consider that they may also use anti-scarping code on the client.
Even without a documented API, there is often one at play using websites. Example: Website is available at "http://example.com/books/1234". When loading it, you see that it fires a request to "http://example.com/api/booksdata/1234" to load the data that popluates the page. So now you don't have to use a slow browser that loads everything, but can just use your normal http client (for all the ids you know).
That's true. You can do it, of course. I'm not saying that this is the only way of doing it.
I think what's being noted is that when the data comes as a data structure on the page, or as data passed back from an XHR request, you can just use that data directly and there's less page scraping to be done. This is generally how dynamic pages are created, out of a shipped data structure and rules to create the page out of it. If you have the data structure, it's generally much easier to parse than the page generated from it.

That said, for pages that use a background request to fetch the data, this can be useful, as that data used to build the page isn't always kept around as a data structure (at least not one easily accessible) afterwards. That is if accessing the endpoint of the background request isn't feasible for some reason.

He means that you look at the private API and use that.