Hacker News new | ask | show | jobs
Ask HN: Is there any HTML table scraper generator in python or else?
2 points by jeffjia 4622 days ago
Hi,

In one of my projects, I happen to need to get some scrapers running for tens of websites to collect rows, columns of tables (<table>, <ul>, <div>). Those tables are well formatted. I have written several scrapers in python, which basically use CSS selector and then do some simple transformation with regular expression. I just wonder whether there is any scraper generator which may take a url and sample target output as input, and produce a scraper automatically?

Any suggestion is welcomed. Thanks in advance.

5 comments

Have you looked at phantomjs?

The webintro example here (https://github.com/ariya/phantomjs/wiki/Examples) scrapes a specific element.

I was using mechanizer + beautiful soup in python before, but it seems that this one also needs human to read the html source code to pick a css selector instead of automating it...
I would take a look at the Mac App FakeApp. It does a lot of what you are saying expecially in regards to CSS and xpath selectors. I have been using it and have been able to do some really great stuff.
If you don't want to build it yourself, check out import.io. They turn any website into an API. They did a demo at SV Newtech a couple months ago.
Thanks Johnie. It is almost what I want, except that it is not open-source and free...
Have you taken a look at the Scrapy framework for Python?

http://scrapy.org/

Thanks. I used beautiful soup for the parser, and actually have written a crawler framework for my scenario. But I was wondering whether there is any tool that could automate the selection of css selector or xpath.
I wrote a couple a few scrapers and found scrapy to be my best option
I've used BeautifulSoup to do stuff like this.
Yeah. Me too. The css selector is quite convenient. The only problem is that I need to pick the selector set for each website I need to scrape, and there are tens of them, which makes the work itself time-consuming...