Hacker News new | ask | show | jobs
by minimaxir 3517 days ago
Rvest works fine with tabular data. If, however, you are working with data outside of Wikipedia, you will find that website data is very rarely available in a <table> and is instead part of a hierarchical tree, which is a pain to process/clean in R.

In such cases, working with Python/BeautifulSoup4 and importing the clean and normalized data into R will save frustration over time, even offsetting the overhead of using two languages.

2 comments

I will work with any data, as soon as it is easily retrieved with some css selector. Otherwise you would have problems using any web scraping tool.
JSON is pretty easy to unpack, if you can figure out the call back that gets the data.
The primary use case for web scraping tools like Rvest is for data that doesn't have a JSON endpoint and everything is rendered serverside, or is a static web page.
> In such cases, working with Python/BeautifulSoup4

BUT Rvest is a BeautifulSoup inspired library and works pretty much the same way?