Our R environment – the tidyverse

Ceci n'est pas une pipe

The tidyverse

Within R, we’re going to be using a set of packages that are part of what is called the “tidyverse”. These used to be distinguished by a special symbol called a “pipe”: %>%. The tidyverse pipe was so successful that the same functionality is now available in “base R” – it just looks slightly differently: |>. How does the pipe work? It passes whetever is on the left side of the pipe as the first argument to the right side of the pipe:
If we want to combine two strings, we can use paste0("Data", "Science). With a pipe, we can write the same thing as "Data" |> paste0("Science")
- If this isn’t completely clear, don’t worry about it. The example later on make this very intuitive.

Setting up our R environment

Let’s load up some helpful packages:

library(rvest) # the basic scraping library
library(dplyr) # a key part of the "tidyverse" that helps us manipulate Data
library(stringr) # a useful library that helps us clean up scraped text

You might need to install these packages if you don’t have them.

R’s webscraping commands

Within R, webscraping is best done with a package called rvest
Here are the functions you need to know:
- read_html(url) will read a webpage into memory from the given url
- html_node(page, css selector) will return the first node of a given page matched by the CSS selector.
- html_nodes(page, css selector) works identically to html_node but selects all nodes matched by the css selector
- html_text(node) extracts all text from a nodes
- html_attr(node, attr) extracts the content of a given attribute from a node.
- html_table(node) extracts an entire table into an R data frame
Go to the next page