|
|
|
|
|
by pixelmonkey
4076 days ago
|
|
One of the supported projects is splash, which is basically WebKit-as-a-service. It takes an interesting approach to crawling where it renders the page using WebKit, and then exposes the "rendered DOM" -- so that your crawling code doesn't need to actually use JavaScript for information extraction. See: https://github.com/scrapinghub/splash People often use Scrapy + Splash together in the Python community for crawling more dynamic websites. A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all. https://github.com/istresearch/scrapy-cluster |
|
GoogleBot indexes content rendered by Javascript - even content delivered by an AJAX request. They've announced they are going to start penalizing sites that don't work well on mobile. I don't know the specifics of that (and they probably haven't shared them) but I do know that I've received automated email from Google Webmaster Tools and/or AdSense about one of my sites not working great on mobile: small UI elements grouped too closely together, content that's too wide, etc.