| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bruno2223 3381 days ago
	Yes, indeed, scraping is the easiest part. Saving everything in a way for use it later is much harder (and expensive), IMHO.

2 comments

dewey 3380 days ago

I'd argue that this is highly dependent on the type of data you scrape and the what you want to do with the data.

If you have a good data model the categorizing, storing and searching of the final result the isn't a big problem and the scraping is the complicated part. If you don't have a specific kind of resource you are scraping and just dump everything into some storage solution with no structure that's going to be the hard part while scraping is the easy part.

link

z3t4 3380 days ago

In theory, say you want to index one billion (10^9) web sites. Using modern hardware, you should be able to crawl, 10,000 web pages per second, which would take ca 30 hours, and if you save 1kb of text from each web site, that would be ca 1 TB of data. Doing a text search of 1TB of text would take some time though, maybe minutes. You could partition the data between servers though.

link

iagovar 3379 days ago

I use couchdb with replication and postgreSQL as data warehouse.

Anyway Im a noob, but reading here and there is what I decided to use.

For scraping Im using scrapy + selenium and a modified js script that uses chrome (webscraper.io).

link

z3t4 3378 days ago

i would just make a naive implementation, instead of searching for the optimal tools and solutions.

link