Hacker News new | ask | show | jobs
by nreece 6644 days ago
Our startup, Feedity - http://feedity.com , generates/creates RSS web feeds from virtually any webpage, for the purpose of content tracking and mashup data reuse.

We scrape public webpages (with an option for content owners to restrict access), and we use the .NET Framework' in-built socket implementation (System.Net namespace) for fetching remote content.

Our biggest frustration was to deal with invalid charset/content encoding of the source webpages. But we resolved it using a custom module. Now everything we parse is unicode (utf-8)!

The collest hack we've encountered while scraping is utilizing the Conditional GET behavior using the HTTP If- Modified-Since header.