Hacker News new | ask | show | jobs
Who needs to scrape millions of pages, or monitor them?
5 points by calufa 4665 days ago
Hi,

For the last year I have been working on a easy to use web scraper called Tales. Tales is written in java. Tales uses http apis to start scraping. It has an html dashboard where you can see in real time things like: memory, cpu, pages per second scraped, errors, server health, and other dev-friendly goodies.

Tales gives you an out of the box way of scraping html and put them into s3 (i.e ...:8080/start?process=tales.scrapers.LoopScraper -template tales.templates.DynamicDataDownloader -threads 2 -namespace com_twitter -baseURL twitter.com), but you can also extend it for custom scraping logics. An example of custom scraping could be to scrape title, ratings, images, blobs, and store it into mysql using simple tales java apis.

Tales is made of interesting services. Among them we can find:

- GitSync: maintains code in the server up to date, all you need is to push from your local computer. - DirListener: among other important things it compiles the services every time it sees a change. - ServerMonitor: keeps track of the server health. - S3DBBackup, S3DBRestore: backups and restores databases -- You may ran out of space, or want to move.

Tales can run as many threads as you like, it uses little memory and cpu, and run for days. It can run in many servers at the same time, with all the databases located in 1 place, or distributed across the servers, all manageable via java apis or the config file.

Tales can also failover to another server when blocked. The failover logic uses a java interface, with it you can write custom ip pooling logics.

Tales had scraped 10s of millions of pages across many domains.

* Source https://github.com/calufa/tales-core * Documentation is old, I will update it soon.

I am currently working with big data -- solr, OpenNLP, all that sugar -- and I needed data from custom sources and I didn't want to run 10 shells to get that done.

calufa@gmail.com

linkedin.com/in/calufa

2 comments

Cool !! Is it robots.txt compliant? If not, it might be a good idea to make this available as an option/parameter.

For 'quick and dirty' tasks, wget -r can come in handy too.

Currently it doesn't look at the robots.txt file. I had taken note of this, and will be added in future releases. Thanks for the suggestion.
very cool, is it doing a depth first blind crawl of any domain you throw at it?
It will basically go through all the links it finds, that can be millions of links. You can also tell it to ignore certain links using regular expressions or via tales java apis.
Here is a sample code of a 1 depth scrape:

https://github.com/calufa/tales-templates/blob/master/core/s...

This is the api call to start the scraper on twitter:

http://localhost:8080/start?process=tales.scrapers.LoopScrap... -template tales.templates.FirstDepthTemplate -threads 2 -namespace com_twitter -baseURL twitter.com