Ask YC: What do you scrape? How do you scrape? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Ask YC: What do you scrape? How do you scrape?

51 points by schaaf 6645 days ago

Theory-wise, there's regular languages, context-free grammars, and combinatorial categorial grammars ( http://openccg.sf.net ). But regular + lists seems adequate for most tasks.

What sorts of scraping do you find yourself doing?

What are your biggest frustrations?

What's the coolest hack you've encountered while scraping?

My cofounder and I have been working on a domain-specific language to make scraping quick and easy, so that you can write, say, 100 different website scrapers in less time -- http://dartbanks.com/simplescrape . We'd love feedback on this approach.

36 comments

foodawg 6645 days ago

I do all my screen scraping with PHP, curl, and some regex. Previously I used plain PHP.

I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on.

The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done.

Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates.

Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax.

m0nty 6645 days ago

"Also AJAX certainly has changed the way a lot of screen scraping is done."

I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)

alex_c 6645 days ago

Even though it's actually a testing tool, you might have some luck with Canoo Webtest + Groovy (http://webtest.canoo.com). Webtest uses HtmlUnit which has pretty good Javascript support, and means you don't have to mess with regexps to get around the document structure, and Groovy lets you use an actual programming language rather than the awkward Ant-based syntax of Webtest. It takes some getting used to, and I haven't used it for web scraping, but it's a pretty powerful combination.

m0nty 6645 days ago

Thanks, I'll give it a try. I'm collaborating on a project which involves getting info from online financial markets, btw, but it's getting held up because of this scraping problem. So new ideas might help get it moving again.

henning 6645 days ago

If you really want to go bonkers on scraping, there are books on this.

http://nostarch.com/frameset.php?startat=webbots

http://www.oreilly.com/catalog/spiderhks/

That probably covers the topic of scraping pretty exhaustively.

inovica 6645 days ago

We are working on a lot of scraping and analysis and here are a few links that you might be interested in if you are using Python:

http://pyro.sourceforge.net/

http://pyprocessing.berlios.de/

http://www.sqlalchemy.org/

http://codespeak.net/lxml/

http://nltk.org/index.php/Main_Page

The biggest hurdle is in understanding how to navigate through a complex site - such as a forum, real estate etc. We have created a visual tool for this however there are other methods. Look at dapper.net as this is useful.

I am wondering if there could be some collaborative effort from the minds on this site to create something unique and groundbreaking

hooande 6645 days ago

I've done my share of screen scraping, gathering all different kinds of data. Movies, sports, finance, you name it. Here are three things I can tell you:

1. Take the time to get very familiar with regular expressions. If you think you know your regex pretty well, go to the docs or get a book and find three things you don't understand and understand them fully. Then find three more.

2. The data doesn't have to be perfect. In most cases you can clean it after you've stored it. It's generally better to get more than you think you might need (in terms of data or html/formatting around the data) and then go back and clean it later

3. Generally, my most successful data mining algorithms involve a lot of hacks. There are very few clean formulas...usually I have to play with the data for awhile and fix a lot of one offs and special cases and then it ends up coming out ok

jraines 6645 days ago

I use Ruby, with its nice regex support and libraries (open-uri, REXML) and the hpricot and mechanize rubygems.

Yahoo Pipes is also fun to play with; and Firebug is the scraper's best friend.

Right now I'm working on scraping public LinkedIn data. In the past I've done Craigslist and Twitter. I haven't done anything really hard, though -- mostly things that can be read as XML.

Here's a few cool links if you're interested in scraping with Ruby: http://del.icio.us/jeremyraines/scraping

vikram 6645 days ago

I'm working on something similar. Turns out scrapping a small part of the problem. I don't use beautifulsoup. Turns out you can transform html of a page into a list, which can easily be scrapped.

Now that I have used it to extract data out of many different types of pages. I'm looking to turn it into a dsl. So that the code looks natural. Currently it's just functions which search for tags in html. You can then easily filter some or others. here is an example

(extract-all page [(and (tagp _ :a) (classp _ "jdtd4"))])

petercooper 6645 days ago

The most powerful, general level scraping stuff I've come across lately has been ScRUBYt : http://scrubyt.org/ .. although I admit I don't have much to do to use it often.

It lets you specify which items on an initial / prototype page you want to scrape, and then it builds up a set of rules than then work on future similar instances of that page. Good for scraping eBay, Google, stuff like that.

thorax 6645 days ago

I use BeautifulSoup when needed for simple scraping.

My biggest frustrations, right now, are really around getting data from lots of different websites in subtly varied forms. This is a tough problem to automate. I certainly haven't found any tools that make it simple.

I'd be happy with a 50% correctness rate, looking for very loose patterns. I just haven't found a tool and, while I have some ideas for how to do it, it's a major project in itself to produce something that can do this.

For example, imagine writing a scraper that would parse out every food recipe online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping I'm looking for and the best I'd have is putting together a neural network or other system that I can train against human-provided data. Unfortunately getting such a system to partition the text to just the recipe would be difficult.

fallentimes 6645 days ago

Getting just the recipe would be the hardest part, but it's still doable. Once you figure out that you're currently parsing a recipe (via keywords, close matching, whatever) you could fan out and look for common start/end tags like <p>, <div>, etc. If you use something like Beautiful Soup you could do this pre-parse instead of post-parse and eliminate a lot of extra stuff (no recipes in the <head> tag, etc.)

After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.

If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.

Edit: And it'd be much cheaper than a neural network ;)

danohuiginn 6645 days ago

nod I've thought about this a fair amount, too. You can do a lot to, say, figure out which pages contain recipes, even identify the structured information like ingredient lists (they're just lists full of foodstuffs and quantities). But IME it all falls apart when you need to find a block of text - like the descriptive part of the recipe. That's rarely marked up very clearly, and tends to blend into the rest of the text. So you either miss parts of the recipe, or pick up chunks of junk from the rest of the page.

That said, it's likely do-able, as long as you don't need perfect results. There are plenty of sites around that seem to be doing things along these lines - but AFAIK none of them have open-sourced their code.

Meanwhile, I've been a coward and stuck to beautiful soup for my scraping projects. In the short term, it works out faster than trying to be too clever.

imrobotmaker 6645 days ago

I use curl, wget and links to retrieve data from sites and then I filter it with old sed and grep.

I created a mashup of AIM + Flicker.

If you use AIM 6 or AIM lite send a message to MyPictureBuddy

then send a message and enjoy.

basically You type a keyword and it gots to flicker and retrieves image information to display pictures right inside your AIM chat session.

I also have another Bot that parses HackerNews XML and then display it on the chat session. The bot name is

HackerNewsYC

jharrison 6645 days ago

I used Mechanize and Hpricot on a project recently to create a sort of poor-man's API. My client is a performing arts organization that wanted a new website but they already had a (dreadful) internally-hosted site for selling tickets.

In order to keep website users from having 2 accounts I created an interface that scrapes the sign in, sign up, lost password, change password, and couple other screens of the internal system. So when users come to the website and "login" they're actually logging in to the internal system and I just record their session from the internal system so I can masquerade as them as they go about their business.

It's not going to support 100s of connections per second but it gets the job done for their traffic levels (36,000 views the first day of launch).

nreece 6644 days ago

Our startup, Feedity - http://feedity.com , generates/creates RSS web feeds from virtually any webpage, for the purpose of content tracking and mashup data reuse.

We scrape public webpages (with an option for content owners to restrict access), and we use the .NET Framework' in-built socket implementation (System.Net namespace) for fetching remote content.

Our biggest frustration was to deal with invalid charset/content encoding of the source webpages. But we resolved it using a custom module. Now everything we parse is unicode (utf-8)!

The collest hack we've encountered while scraping is utilizing the Conditional GET behavior using the HTTP If- Modified-Since header.

mosburger 6645 days ago

Either Beautiful Soup, or Yahoo Pipes... I have a website that parses RSS feeds, and some sites don't have feeds yet! Or if they do, they aren't usable. So I use Pipes to scrape a page and turn it into a feed using their regexp operator, then my site uses that feed.

jauco 6645 days ago

I'm surprised nobody has mentioned dapper yet (www.dapper.net) it's a really nice approach at turning web-sites into structured content.

herdrick 6645 days ago

HtmlPrag turns any HTML into nice s-expressions. It's a Scheme library. http://www.neilvandyke.org/htmlprag/

I've used it a lot - it's really great.

aquateen 6645 days ago

I used Hpricot to scrape web.archive and reddit to make http://reredd.com.

Plan on scraping past billboard charts to let people listen to the radio back in time.

dangoldin 6645 days ago

I come from a Perl background so I've been using HTML::TreeBuilder and XML::TreeBuilder to do my parsing. It will basically load an HTML/XML file into it's own tree structure and give you an easy way to go through it. By knowing how each site names their divs/classes I am able to scrape.

I took a quick glimpse at beautiful soup and it seems to be doing something similar - someone let me know if this is correct.

mrtron 6645 days ago

Yes. You can even regex search through the tree. Weeeeee!

BeautifulSoup is nothing unique, but it can handle malformed data that saves you a ton of hassle.

friism 6645 days ago

Scraping EU public procurement contracts from the "Tenders Europa Daily" database (http://ted.europa.eu/). There's more than a million documents with each document requiring up to two requests. Been at it for several weeks with a multithreaded scraper and we're almost through. Using Solvent (simile.mit.edu/solvent/) to generate xpath expressions and HtmlAgilityPack (www.codeplex.com/htmlagilitypack) to run the xpath on the downloaded html with regexps as the topping. They're a match made in heaven (http://www.itu.dk/~friism/blog/?p=40).

The login procedure is gothic and took a lot of wiresharking to figure out. .Net has pretty good scraping-support in the WebClient and HttpWebRequest classes found in the System.Net namespace.

Will publish results soon... :-)

inovica 6645 days ago

Be careful here. The content is actually copyrighted. Whilst you can scrape it their T&Cs expressly forbid it. They sell licenses to access this information - the license is NOT expensive and they provide direct access to all the data in XML.

friism 6644 days ago

http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&Template...

Quote: "Reproduction is authorised provided the source is acknowledged. However, to prevent disruptions in service to our normal users from bulk downloads of TED data, we reserve the right to check for, and block, attempts to download excessive quantities of documents, particularly using automated or robot-like tools."

... they apparently chose not to exercise that right in this case, the scrape completed last night (all 18 GB of it).

jdvolz 6644 days ago

I've written a lot of this sort of program over the last 18 months. This is something that people are in need of all the time. I would say that there isn't yet a tool which does this to the level that customers want.

I use Mechanize, both in its Ruby and Python forms (I prefer Ruby) and plain old regular expressions to get the information that I want. Often times I will use a divide and conquer strategy by removing part of the web page (for example, the <head>) and successively paring it down to what I really want.

Javascript can be a problem. What I normally do is actually read the Javascript on the page, and then recreate that behavior in my Ruby code. Often times this means simply setting some form values (usually hidden) and then submitting the form.

bkrausz 6645 days ago

Scraping HN for fnid's to auto-post xkcd comics :-P. It was a hack so I just did string searches.

3KWA 6645 days ago

web scrapping with beautiful soup (is this old school already?) - parsing the Sydney Future Exchange for data back in 2003 (still running)

fallentimes 6645 days ago

We're using a general-use multi-threaded crawler to get the pages and then using Beautiful Soup and a bit of regex to parse them. Though we are scraping multiple sites, they are all in the same "category" so to speak, so there are a lot of generic parsing methods that are simply overridden when necessary. PyParsing was played with for a while, but since data comes in so many slightly varied forms I was ending up with rules that were miles and miles long just to find a simple price or date/time on a page that would work for the largest number of sites possible.

sheriff 6644 days ago

My startup, http://www.FuseCal.com (previously discussed at http://news.ycombinator.com/item?id=146134), scrapes calendar events out of web pages and into your personal calendar. In the general case, we don't know anything about the layout of the page before trying to extract the events, so there's something of a classification problem first.

johnb 6645 days ago

I'm a big fan of using Hpricot + Ruby. I'd say the sites I had been scraping but I doubt my old client wants it to come out :|

To get the most bang for my buck (developer time wise) I would visit each site with firebug in inspect mode, hover the data I want to extract. From there I figure out how I would style that element, and because Hpricot supports CSS selectors I've straight away got a method for pulling that data out of the page.

mk 6645 days ago

This sounds redundant already, but I scrape using beautiful soup. Right now I'm scraping a lot of news sites and feeds for a project I am working on.

ivrokv 6645 days ago

This can be very useful. I use pyparsing with custom python code for scraping.

mrtron 6645 days ago

I always do custom stuff in beautiful soup, but this looks somewhat cool.

Maybe have it so you can edit the sample text and language and see the results all on a web page?

blinks 6644 days ago

http://gatherer.wizards.com with BeautifulSoup, the only parser I've found that can deal with this @$%^! HTML.

andrew311 6644 days ago

Do any of you who scrape fear retaliation from the sites you scrape? Maybe you are violating a ToS or scraping copyrighted text, and they cut off your IP. Thoughts?

inovica 6644 days ago

I think you have to take into consideration the TOS, copyright and also robots.txt. If you ignore these then its well within the site owners rights to do something about it - blocking you or further. We always look at the robots.txt file first and use that as our benchmark in terms of what they (the site) wish robots/crawlers to look at

glasner 6644 days ago

I have a similar DSL built in Ruby that can be run by by either Mechanize or Watir. I highly recommend Watir if you need to scrape ajax.

misterbwong 6645 days ago

I've been using C# with the HTMLAgilityPack. Probably not as fast as it could be, but C# is what I know best.

bct 6645 days ago

Yum, declarative!

I can't say I'm crazy about the syntax, but I'll give this a try when I get home.

bprater 6645 days ago

Firebug can be helpful for finding elements you want to regex on!

michaelneale 6644 days ago

anyone that complains about HTML scraping is a pussy. Seriously its trivial compared to what we had to do in the past. I like hpricot for ruby.

latone 6645 days ago

Longest Common Subsequences are quite useful as well.

schaaf 6645 days ago

Do you mean for adding a little resilience to your rigid model, or something funkier?

yters 6644 days ago

anyone have success with emacs and w3? I haven't given it a shot yet, but seems like its interactive nature might be useful.

ashu 6645 days ago

What: Banks. With: lib-www-perl.