Hacker News new | ask | show | jobs
by friism 6645 days ago
Scraping EU public procurement contracts from the "Tenders Europa Daily" database (http://ted.europa.eu/). There's more than a million documents with each document requiring up to two requests. Been at it for several weeks with a multithreaded scraper and we're almost through. Using Solvent (simile.mit.edu/solvent/) to generate xpath expressions and HtmlAgilityPack (www.codeplex.com/htmlagilitypack) to run the xpath on the downloaded html with regexps as the topping. They're a match made in heaven (http://www.itu.dk/~friism/blog/?p=40).

The login procedure is gothic and took a lot of wiresharking to figure out. .Net has pretty good scraping-support in the WebClient and HttpWebRequest classes found in the System.Net namespace.

Will publish results soon... :-)

1 comments

Be careful here. The content is actually copyrighted. Whilst you can scrape it their T&Cs expressly forbid it. They sell licenses to access this information - the license is NOT expensive and they provide direct access to all the data in XML.
http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&Template...

Quote: "Reproduction is authorised provided the source is acknowledged. However, to prevent disruptions in service to our normal users from bulk downloads of TED data, we reserve the right to check for, and block, attempts to download excessive quantities of documents, particularly using automated or robot-like tools."

... they apparently chose not to exercise that right in this case, the scrape completed last night (all 18 GB of it).