Ask HN: How to aggregate product info from other websites

Y	Hacker News new \| ask \| show \| jobs

	Ask HN: How to aggregate product info from other websites
	14 points by jreilly 6429 days ago
	Any input on the best way to aggregate product information from various websites would be much appreciated. Most of the websites I would like to aggregate lack any APIs that I could use to track prices and things of that nature. I have zero experience web scraping of any kind so any direction would be helpful. Before I start digging I figured HNers may have some invaluable advice.

7 comments

astrec 6429 days ago

Python & Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) are your friends here.

link

jreilly 6429 days ago

Anyone know of a similar library in rails that works well?

link

rgrieselhuber 6429 days ago

I'm pretty sure this (HPricot) is the one most people use in Ruby:

http://code.whytheluckystiff.net/hpricot/

link

abijlani 6429 days ago

Here's a great hpricot tutorial

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-s...

link

qhoxie 6429 days ago

hpricot is great, but I hear lately that Nokogiri is faster and just as capable:

http://github.com/tenderlove/nokogiri/tree/master

link

ivanstojic 6428 days ago

Nokogiri even has a HPricot emulation mode, if you need to port HPricot code.

link

shabda 6429 days ago

You might also want to 1. Worry about the Copyright laws. 2. Make sure you do not hit the site so often that you show up in their logs as bandwdth hog and are blocked.

link

DenisM 6429 days ago

I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.

Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

link

olegp 6429 days ago

Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

link

Harkins 6429 days ago

I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.

link

joseakle 6429 days ago

i am using beatiful soup, it works with malformed markup

link

DenisM 6429 days ago

Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?

link

ryanwaggoner 6429 days ago

I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.

link

jreilly 6429 days ago

Thanks for the input. I am currently learning rails so also wondering if there are any libraries that will make this significantly easier

link

thwarted 6429 days ago

See if the sites in question are part of an affiliate network, like Commission Junction or Link Share. They often provide plain-text feeds to affiliates through these programs, and many of their terms of service enable you to set up these kinds of services (although some have restrictions on mixing their data with data from their competitors). However, I've found that even this data isn't all that great, cleanliness wise (sometimes you can't trust the name of the product, the price, the link, or the SKU to even match the website) and isn't updated very often (like product availability). But it's a hell of a lot easier than writing a custom parser for each site's HTML (although when I was working on project like this, I had to write a custom parser for each feed in order to put them in a more consistent format).

link

petercooper 6429 days ago

For Ruby, consider Scrubyt: http://scrubyt.org/

If you're wondering why, well, consider this script that "learns" how to scrape Google results (from one supplied example of output data):

  google_data = Scrubyt::Extractor.define do
    fetch 'http://www.google.com/ncr'
    fill_textfield 'q', 'ruby'
    submit

    link "Ruby Programming Language" do
      url "href", :type => :attribute
    end

    next_page "Next", :limit => 2
  end

  puts google_data.to_xml

Reads almost like English in the scraping part!

link

aneesh 6429 days ago

Perl's WWW::Mechanize module is a good choice for scraping & automating website interactions.

link

qhoxie 6429 days ago

Mechanize also has a ruby port since you are working with rails.

link

tocomment 6429 days ago

What are you trying to do exactly? It depends a lot on the type of data you're trying to gather.

link

jreilly 6429 days ago

I am basically trying to track prices of certain products easily so I do not have to worry about doing it by hand and checking myself every once in a while.

link

ks 6428 days ago

Have you considered looking at the price comparison sites? Some of them have an API. Unless you plan to compete with them directly, you will save a lot of time.

Example: http://developer.yahoo.com/shopping/

link