|
|
|
|
|
by friism
6645 days ago
|
|
Scraping EU public procurement contracts from the "Tenders Europa Daily" database (http://ted.europa.eu/). There's more than a million documents with each document requiring up to two requests. Been at it for several weeks with a multithreaded scraper and we're almost through. Using Solvent (simile.mit.edu/solvent/) to generate xpath expressions and HtmlAgilityPack (www.codeplex.com/htmlagilitypack) to run the xpath on the downloaded html with regexps as the topping. They're a match made in heaven (http://www.itu.dk/~friism/blog/?p=40). The login procedure is gothic and took a lot of wiresharking to figure out. .Net has pretty good scraping-support in the WebClient and HttpWebRequest classes found in the System.Net namespace. Will publish results soon... :-) |
|