Hacker News new | ask | show | jobs
Ask HN: How to aggregate product info from other websites
14 points by jreilly 6429 days ago
Any input on the best way to aggregate product information from various websites would be much appreciated. Most of the websites I would like to aggregate lack any APIs that I could use to track prices and things of that nature.

I have zero experience web scraping of any kind so any direction would be helpful. Before I start digging I figured HNers may have some invaluable advice.

7 comments

Python & Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) are your friends here.
Anyone know of a similar library in rails that works well?
I'm pretty sure this (HPricot) is the one most people use in Ruby:

http://code.whytheluckystiff.net/hpricot/

hpricot is great, but I hear lately that Nokogiri is faster and just as capable:

http://github.com/tenderlove/nokogiri/tree/master

Nokogiri even has a HPricot emulation mode, if you need to port HPricot code.
You might also want to 1. Worry about the Copyright laws. 2. Make sure you do not hit the site so often that you show up in their logs as bandwdth hog and are blocked.
I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.

Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.
i am using beatiful soup, it works with malformed markup
Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?
I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.
Thanks for the input. I am currently learning rails so also wondering if there are any libraries that will make this significantly easier
See if the sites in question are part of an affiliate network, like Commission Junction or Link Share. They often provide plain-text feeds to affiliates through these programs, and many of their terms of service enable you to set up these kinds of services (although some have restrictions on mixing their data with data from their competitors). However, I've found that even this data isn't all that great, cleanliness wise (sometimes you can't trust the name of the product, the price, the link, or the SKU to even match the website) and isn't updated very often (like product availability). But it's a hell of a lot easier than writing a custom parser for each site's HTML (although when I was working on project like this, I had to write a custom parser for each feed in order to put them in a more consistent format).
For Ruby, consider Scrubyt: http://scrubyt.org/

If you're wondering why, well, consider this script that "learns" how to scrape Google results (from one supplied example of output data):

  google_data = Scrubyt::Extractor.define do
    fetch 'http://www.google.com/ncr'
    fill_textfield 'q', 'ruby'
    submit

    link "Ruby Programming Language" do
      url "href", :type => :attribute
    end

    next_page "Next", :limit => 2
  end

  puts google_data.to_xml
Reads almost like English in the scraping part!
Perl's WWW::Mechanize module is a good choice for scraping & automating website interactions.
Mechanize also has a ruby port since you are working with rails.
What are you trying to do exactly? It depends a lot on the type of data you're trying to gather.
I am basically trying to track prices of certain products easily so I do not have to worry about doing it by hand and checking myself every once in a while.
Have you considered looking at the price comparison sites? Some of them have an API. Unless you plan to compete with them directly, you will save a lot of time.

Example: http://developer.yahoo.com/shopping/