Webscraping with Rvest

Y	Hacker News new \| ask \| show \| jobs

	Webscraping with Rvest (programmingr.com)
	44 points by hanginghyena 3517 days ago

8 comments

minimaxir 3517 days ago

Rvest works fine with tabular data. If, however, you are working with data outside of Wikipedia, you will find that website data is very rarely available in a <table> and is instead part of a hierarchical tree, which is a pain to process/clean in R.

In such cases, working with Python/BeautifulSoup4 and importing the clean and normalized data into R will save frustration over time, even offsetting the overhead of using two languages.

link

haddr 3517 days ago

I will work with any data, as soon as it is easily retrieved with some css selector. Otherwise you would have problems using any web scraping tool.

link

sixtypoundhound 3517 days ago

JSON is pretty easy to unpack, if you can figure out the call back that gets the data.

link

minimaxir 3516 days ago

The primary use case for web scraping tools like Rvest is for data that doesn't have a JSON endpoint and everything is rendered serverside, or is a static web page.

link

baldfat 3516 days ago

> In such cases, working with Python/BeautifulSoup4

BUT Rvest is a BeautifulSoup inspired library and works pretty much the same way?

link

baldfat 3517 days ago

The reason why so many people were mixing Python code with R was specifically for these sort of task. Web scraping in R has really caused me to not touch another tool outside of R for a few years now and it is great.

Well done Hadley Wickham being inspired by libraries like Beautiful Soup and bringing a great tool to R.

link

haddr 3517 days ago

It really looks as easy as it can get. The good part of R is that many R packages are designed in a similar way (highly specialized methods, doing a good job). Combining that with %>% makes you really efficient.

link

jbmorgado 3515 days ago

This seems really an intuitive way of getting the tables. What would be the most similar library in python for those cases where R isn't available in the system (with the permissions in some labs machines, unfortunately it takes weeks-forever to get R installed)?

link

josep2 3517 days ago

I've written a few blog posts where I used Rvest to get data and R's great visualization tools to visualize it. R has a ton of issues as a platform and language but this is a fantastic package and it has a great ecosystem for small data (the majority of data).

link

gabrielcsapo 3517 days ago

all of these web scraping frameworks, doesn't it tell you that the web needs more wide spread semantic markup?

<jobs-list> <job> <employer> YCombinator </employer> <position> ... </position> </job> </jobs-list>

Something like that?

I know what everyone will say, it is so terse and convoluted, but maybe something like

<ul semantic-markup="jobs"> <li semantic-markup="job"> <p semantic-markup="job-employer"> YCombinator </p> <p semantic-markup="job-position"> ... </p> </li> </ul>

Seems like a lot of work though...maybe I take that back.

link

nathancahill 3516 days ago

If websites wanted to make their data accessible, they would create APIs. Data on websites is inaccessible for a reason.

link

ankimal 3516 days ago

I've had great success with Ruby/Mechanize for regular html scraping and phantomjs for dynamic page scraping.

link

data_spy 3516 days ago

Rvest is for webscrapping newbs. A more seasoned R person would still use PhantomJS and RSelenium as it actually collects all the page's information but Rvest only collections a portion of it. Try it on washingtonpost.com and you will see.

link

baldfat 3516 days ago

> Rvest is for webscrapping newbs

down voted for calling people newbs. Also it always depends on what tool works best.

link