|
|
|
|
|
by alexmobile
4076 days ago
|
|
Thanks for posting - this is amazing... I was actually working on a far from finished article "How a Search Engine Startup company could compete with Google" http://bitexperts.com/Question/Detail/42/how-a-search-engine... and then this announcement came across. Will be looking at what kind of crawlers they would release. Hopefully some modern ones based on a WebKit / Chromium core, that exposes DOM model and suitable for navigating all these AJAX script fueled modern web interfaces. Also very interested to see what kind of Machine Learning / classifiers they are using. When working on our search engine, we were using purely statistical classifiers like all variations of Bayes, SVM (Support Vector Machines) and decision trees C4.5 glued together with some custom algos. We did not use neural nets at all. Nowadays, neural nets have a new name - "deep learning" and seem to be everywhere. Really really interesting in terms of what people would build with an Open Source Search Engine. Watch out Google :) |
|
https://github.com/scrapinghub/splash
People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.
A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.
https://github.com/istresearch/scrapy-cluster