| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pixelmonkey 4076 days ago

One of the supported projects is splash, which is basically WebKit-as-a-service. It takes an interesting approach to crawling where it renders the page using WebKit, and then exposes the "rendered DOM" -- so that your crawling code doesn't need to actually use JavaScript for information extraction. See:

https://github.com/scrapinghub/splash

People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.

A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.

https://github.com/istresearch/scrapy-cluster

2 comments

WalterGR 4076 days ago

    It takes an interesting approach to crawling where
    it renders the page using WebKit, and then exposes
    the "rendered DOM" -- so that your crawling code
    doesn't need to actually use JavaScript for
    information extraction.

It is an interesting approach. There's evidence that Google crawls the web that way, though I don't know if it's been confirmed by the company.

GoogleBot indexes content rendered by Javascript - even content delivered by an AJAX request. They've announced they are going to start penalizing sites that don't work well on mobile. I don't know the specifics of that (and they probably haven't shared them) but I do know that I've received automated email from Google Webmaster Tools and/or AdSense about one of my sites not working great on mobile: small UI elements grouped too closely together, content that's too wide, etc.

link

hayksaakian 4076 days ago

This is the tool recommended to my by a person on the adwords team:

https://www.google.com/webmasters/tools/mobile-friendly/

According to them, starting april 21st it will be a ranking factor.

link

WalterGR 4076 days ago

    april 21st

Great. April 2011 was when Google launched Panda 1.0, from which I don't think my slang dictionary site has ever recovered.

Thanks for the link. I guess I better hop to it.

link

alexmobile 4076 days ago

Thanks! I've briefly looked at Splash and related projects like ScrapingHub, etc - looks like this niche is live and kicking...

The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)

I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.

I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?

Thanks!

link

pki 4076 days ago

Launch a forum actually using python...

Even the largest 'python forum' is on phpbb..

link