I'm still testing it and improving it (there are so many different websites with different responses...), so If you have any comments I'm looking forward to what you think about it.
Suppose I wanted to extract an image that gets loaded async via Javascript (For example, a Pinterest page). How would that work? Looking at your documentation, it looks like I could parse the XHR array you supply. Could you suggest any other ways? I'm calling out Pinterest as an example here because they try to block their images from being easily downloaded, but if you have any other examples I'd like to hear them.
It would be great if the page analyzer could supply a list of all the assets loaded with the web page; for example, any asset with a media type of image/* is listed in an images array, and so forth.
Actually the list of assets shouldn't be that hard. Looking at pinterest the xhr requests for images are loaded immediately when page is open, so potentialy it then it's catched in onRequest function (only now I'm aborting the requests to save network trafic). I will try it our tomorrow and let you know in comment.
Also, looking at pinterest, it's server rendered through ReactJS, so there is #initial-state script tag with first few images preloaded as urls, so if you cared only about the images on top without scrolling then this is the safest bet.
Good idea and I would implement that if I used an API from server to get the response. But currently I'm at the same time testing stability of Apify "Actor" solution and proxies, so for my case it's good that there are real requests with real responses, even if it's just from demo.
Btw the fact that it's running for 5 minutes is a bug, that I will look at, since there is a timeout of 2 minutes and there are no hanging runs or runs that ended with timeout.
It's why I'm using proxies, every request is routed through different proxy address and the application as whole is rate limited. So hopefully I'm not making too much traffic on yelp. They are just a perfect example because they are using all types of data I'm looking for.
When I find more good examples I will add them and rotate them for every page load.
Btw when it comes to ToS and scraping, this is not much different from accessing their website through normal browser only instead of rendered content we should you analyzed data. The page is only loaded once same as in browser.
Not that i can see from a surface view, i think documentation can be improved :). Personally like the idea of APIFY, saw it a few months ago. Are you guys hiring ? :D
:D yep the documentation needs a lot of work. It started as a test of an idea, then slowly became a usable tool and the code was getting incrementaly more complex without me event noticing. I only added the readme on github yesterday and there are basicaly no tests... :(
It would be great if the page analyzer could supply a list of all the assets loaded with the web page; for example, any asset with a media type of image/* is listed in an images array, and so forth.