Hacker News new | ask | show | jobs
Webscraping with Rvest (programmingr.com)
44 points by hanginghyena 3517 days ago
8 comments

Rvest works fine with tabular data. If, however, you are working with data outside of Wikipedia, you will find that website data is very rarely available in a <table> and is instead part of a hierarchical tree, which is a pain to process/clean in R.

In such cases, working with Python/BeautifulSoup4 and importing the clean and normalized data into R will save frustration over time, even offsetting the overhead of using two languages.

I will work with any data, as soon as it is easily retrieved with some css selector. Otherwise you would have problems using any web scraping tool.
JSON is pretty easy to unpack, if you can figure out the call back that gets the data.
The primary use case for web scraping tools like Rvest is for data that doesn't have a JSON endpoint and everything is rendered serverside, or is a static web page.
> In such cases, working with Python/BeautifulSoup4

BUT Rvest is a BeautifulSoup inspired library and works pretty much the same way?

The reason why so many people were mixing Python code with R was specifically for these sort of task. Web scraping in R has really caused me to not touch another tool outside of R for a few years now and it is great.

Well done Hadley Wickham being inspired by libraries like Beautiful Soup and bringing a great tool to R.

It really looks as easy as it can get. The good part of R is that many R packages are designed in a similar way (highly specialized methods, doing a good job). Combining that with %>% makes you really efficient.
This seems really an intuitive way of getting the tables. What would be the most similar library in python for those cases where R isn't available in the system (with the permissions in some labs machines, unfortunately it takes weeks-forever to get R installed)?
I've written a few blog posts where I used Rvest to get data and R's great visualization tools to visualize it. R has a ton of issues as a platform and language but this is a fantastic package and it has a great ecosystem for small data (the majority of data).
all of these web scraping frameworks, doesn't it tell you that the web needs more wide spread semantic markup?

<jobs-list> <job> <employer> YCombinator </employer> <position> ... </position> </job> </jobs-list>

Something like that?

I know what everyone will say, it is so terse and convoluted, but maybe something like

<ul semantic-markup="jobs"> <li semantic-markup="job"> <p semantic-markup="job-employer"> YCombinator </p> <p semantic-markup="job-position"> ... </p> </li> </ul>

Seems like a lot of work though...maybe I take that back.

If websites wanted to make their data accessible, they would create APIs. Data on websites is inaccessible for a reason.
I've had great success with Ruby/Mechanize for regular html scraping and phantomjs for dynamic page scraping.
Rvest is for webscrapping newbs. A more seasoned R person would still use PhantomJS and RSelenium as it actually collects all the page's information but Rvest only collections a portion of it. Try it on washingtonpost.com and you will see.
> Rvest is for webscrapping newbs

down voted for calling people newbs. Also it always depends on what tool works best.