|
|
|
Ask YC: Blog parsing (WordPress,Typepad,Blogger)
|
|
2 points
by samson
6414 days ago
|
|
I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page. I want these pages ->http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/ Not this ->http://1vibe.net/category/behind-the-scenes/
Not this ->http://1vibe.net/2008/11/
Not this ->http://1vibe.net/tag/50-cent/ From the blog post page I want to grab the title and date of that post The way I trying to do it was to look through the DOM of the site and look for consistency.
I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site. So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it. I appreicate it if anyone could help ( also I'm doing it in php). |
|
Or you could parse the URL - I had a similar task some time ago, and I went with URLs - Blogger and Typepad are consistent; WordPress depends on the blog, of course, but you could figure out several most popular patterns (e. g. /yyyy/mm/dd/posttitle, /id-posttitle) and get like 90% of all blogs right.
Or maybe, just maybe, you could use some third parties that have already figured it out via RSS - maybe Technorati?