Ask YC: Blog parsing (WordPress,Typepad,Blogger)

Y	Hacker News new \| ask \| show \| jobs

2 points by samson 6460 days ago

I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page.

I want these pages ->http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/

Not this ->http://1vibe.net/category/behind-the-scenes/ Not this ->http://1vibe.net/2008/11/ Not this ->http://1vibe.net/tag/50-cent/

From the blog post page I want to grab the title and date of that post

The way I trying to do it was to look through the DOM of the site and look for consistency. I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site.

So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it.

I appreicate it if anyone could help ( also I'm doing it in php).

2 comments

raquo 6460 days ago

If you are interested only in new posts, you can look in blogs' RSS feeds. They are nearly always in default locations.

Or you could parse the URL - I had a similar task some time ago, and I went with URLs - Blogger and Typepad are consistent; WordPress depends on the blog, of course, but you could figure out several most popular patterns (e. g. /yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.

Or maybe, just maybe, you could use some third parties that have already figured it out via RSS - maybe Technorati?

link

samson 6459 days ago

Yea, I think thats the route i'll end up going, I've already started developing a pattern system, and overnight I thought of a few ways that might make that easier to get the title and date of a page.

There's only one thing I'm still stumped on and thats simply how do you tell when your on the original article page and not the index/tag/search/ that still sometimes contains the same content as the article page.

link

Raphael 6460 days ago

Just parse the URL. Or you can pull in the RSS feed, although that usually only goes back 20 posts.

link