Hacker News new | ask | show | jobs
by bruno2223 3334 days ago
Yes, indeed, scraping is the easiest part.

Saving everything in a way for use it later is much harder (and expensive), IMHO.

2 comments

I'd argue that this is highly dependent on the type of data you scrape and the what you want to do with the data.

If you have a good data model the categorizing, storing and searching of the final result the isn't a big problem and the scraping is the complicated part. If you don't have a specific kind of resource you are scraping and just dump everything into some storage solution with no structure that's going to be the hard part while scraping is the easy part.

In theory, say you want to index one billion (10^9) web sites. Using modern hardware, you should be able to crawl, 10,000 web pages per second, which would take ca 30 hours, and if you save 1kb of text from each web site, that would be ca 1 TB of data. Doing a text search of 1TB of text would take some time though, maybe minutes. You could partition the data between servers though.
I use couchdb with replication and postgreSQL as data warehouse.

Anyway Im a noob, but reading here and there is what I decided to use.

For scraping Im using scrapy + selenium and a modified js script that uses chrome (webscraper.io).

i would just make a naive implementation, instead of searching for the optimal tools and solutions.