Hacker News new | ask | show | jobs
by alexmobile 4076 days ago
Thanks for posting - this is amazing... I was actually working on a far from finished article "How a Search Engine Startup company could compete with Google" http://bitexperts.com/Question/Detail/42/how-a-search-engine...

and then this announcement came across. Will be looking at what kind of crawlers they would release. Hopefully some modern ones based on a WebKit / Chromium core, that exposes DOM model and suitable for navigating all these AJAX script fueled modern web interfaces.

Also very interested to see what kind of Machine Learning / classifiers they are using. When working on our search engine, we were using purely statistical classifiers like all variations of Bayes, SVM (Support Vector Machines) and decision trees C4.5 glued together with some custom algos. We did not use neural nets at all. Nowadays, neural nets have a new name - "deep learning" and seem to be everywhere.

Really really interesting in terms of what people would build with an Open Source Search Engine. Watch out Google :)

1 comments

One of the supported projects is splash, which is basically WebKit-as-a-service. It takes an interesting approach to crawling where it renders the page using WebKit, and then exposes the "rendered DOM" -- so that your crawling code doesn't need to actually use JavaScript for information extraction. See:

https://github.com/scrapinghub/splash

People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.

A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.

https://github.com/istresearch/scrapy-cluster

    It takes an interesting approach to crawling where
    it renders the page using WebKit, and then exposes
    the "rendered DOM" -- so that your crawling code
    doesn't need to actually use JavaScript for
    information extraction.
It is an interesting approach. There's evidence that Google crawls the web that way, though I don't know if it's been confirmed by the company.

GoogleBot indexes content rendered by Javascript - even content delivered by an AJAX request. They've announced they are going to start penalizing sites that don't work well on mobile. I don't know the specifics of that (and they probably haven't shared them) but I do know that I've received automated email from Google Webmaster Tools and/or AdSense about one of my sites not working great on mobile: small UI elements grouped too closely together, content that's too wide, etc.

This is the tool recommended to my by a person on the adwords team:

https://www.google.com/webmasters/tools/mobile-friendly/

According to them, starting april 21st it will be a ranking factor.

    april 21st
Great. April 2011 was when Google launched Panda 1.0, from which I don't think my slang dictionary site has ever recovered.

Thanks for the link. I guess I better hop to it.

Thanks! I've briefly looked at Splash and related projects like ScrapingHub, etc - looks like this niche is live and kicking...

The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)

I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.

I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?

Thanks!

Launch a forum actually using python...

Even the largest 'python forum' is on phpbb..