| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jardah 3048 days ago
	I'm still testing it and improving it (there are so many different websites with different responses...), so If you have any comments I'm looking forward to what you think about it.

4 comments

JustARandomGuy 3048 days ago

Suppose I wanted to extract an image that gets loaded async via Javascript (For example, a Pinterest page). How would that work? Looking at your documentation, it looks like I could parse the XHR array you supply. Could you suggest any other ways? I'm calling out Pinterest as an example here because they try to block their images from being easily downloaded, but if you have any other examples I'd like to hear them.

It would be great if the page analyzer could supply a list of all the assets loaded with the web page; for example, any asset with a media type of image/* is listed in an images array, and so forth.

jardah 3048 days ago

Actually the list of assets shouldn't be that hard. Looking at pinterest the xhr requests for images are loaded immediately when page is open, so potentialy it then it's catched in onRequest function (only now I'm aborting the requests to save network trafic). I will try it our tomorrow and let you know in comment.

Also, looking at pinterest, it's server rendered through ReactJS, so there is #initial-state script tag with first few images preloaded as urls, so if you cared only about the images on top without scrolling then this is the safest bet.

_Chief 3048 days ago

how about caching the default entry (static url instead) + attribs, to ease demoing. at the moment it's been analyzing for more than >5mins

jardah 3048 days ago

Good idea and I would implement that if I used an API from server to get the response. But currently I'm at the same time testing stability of Apify "Actor" solution and proxies, so for my case it's good that there are real requests with real responses, even if it's just from demo.

Btw the fact that it's running for 5 minutes is a bug, that I will look at, since there is a timeout of 2 minutes and there are no hanging runs or runs that ended with timeout.

ComputerGuru 3048 days ago

You also don’t want to get your server blocked by yelp if they do rate limiting.

jardah 3048 days ago

It's why I'm using proxies, every request is routed through different proxy address and the application as whole is rate limited. So hopefully I'm not making too much traffic on yelp. They are just a perfect example because they are using all types of data I'm looking for. When I find more good examples I will add them and rotate them for every page load.

Btw when it comes to ToS and scraping, this is not much different from accessing their website through normal browser only instead of rendered content we should you analyzed data. The page is only loaded once same as in browser.

bpicolo 3048 days ago

They have fairly aggressive scraper detection (and this is also against their ToS)

GSGSGS 3048 days ago

Are you Jaroslov ? :)

jardah 3048 days ago

Jaroslav, yes, I'm the author. Did you notice any problems or ways how I can improve it?

GSGSGS 3048 days ago

Not that i can see from a surface view, i think documentation can be improved :). Personally like the idea of APIFY, saw it a few months ago. Are you guys hiring ? :D

jardah 3048 days ago

:D yep the documentation needs a lot of work. It started as a test of an idea, then slowly became a usable tool and the code was getting incrementaly more complex without me event noticing. I only added the readme on github yesterday and there are basicaly no tests... :(

jancurn 3048 days ago

Yes we are! Please see https://www.apify.com/jobs

johnnyfived 3048 days ago

It's great that you're communicating openly on HN.

I just sent an application for the Junior Web Developer position.

Looking forward to hearing back!

rmateus 3048 days ago

Is it able to deal with digital certificates?

jancurn 3048 days ago

You can use https://www.htbridge.com/ssl/ for that