- We talk about when scraping when we retrieve structured information from content on a website
- One example many of you will know is Zotero, which scrapes information from website into bibliographic metadata
- Web scraping is effective when we have
- A large amount of information
- Presented in the same structure on a page or across multiple pages
- And we’re interested in retrieving that information
- This is distinct from “web crawling”, where we try to download a large number of whole webpages. R can be used for that, too, using a different package.
- It’s also different from using API, or Application Programming Interfaces, that are specifically designed to allow you to request structured information
Our Example
- We will work on scraping some basic information for all Assembly Members of the New York State Assembly: https://nyassembly.gov/mem/
The Structure of the Web
- A webpage is made up out of a number elements, called “nodes,” that hold its content. Every node begins with a
<>
and ends with a</>
tag. Nodes are nested in each other.
Here’s how a very simple web page may look like
<html>
<head>
<title>Website Title</title>
</head>
<body>
<h1>Headline</h1>
<div>Some text</div>
<div>More text</div>
</body>
</html>
- You’ll see, above some elements that will occur in virtually every webpage:
- The
head
is not visible in the browser window, but often contains metadata, including the “title” that’s displayed in your open browser tab - Elements like
h1
which are headings. There should only be oneh1
element, but there can be multipleh2
,h3
, etc. - Elements like
div
,span
, andp
which structure how text and images are displayed on the page.
- The
- Modern websites, however, have one more feature, however, that adds to their structure. Every element can have one or more “attributes” that are included in the beginning tag of the node like so:
<div class="article-text">Some text</div>
<div class="article-comment">Some comment</div>
- Attributes can be any string, but there are two attributes that are particularly common and well defined: “class” and “id”.
- The main difference between the two to keep in mind is that the same
class
can occur multiple times on a single webpage.class
is often used to mean that this is content of a certain type, e.g. part of the text of the main article, e.g.
<div class="article-text">Some text</div>
<div class="article-text">Some more text</div>
- On the other hand, in a decently well written webpage, every
id
should only occur one.id
is used to label (“identify”) specific content, e.g.
<div id="abstract">Abstract for the text</div>
<div id="text-body">The body of the article</div>
<div description="special-content">This is somehow special</div>
Looking at Webpage Structure
You can look at how an element of a webpage is represented in the html code by using the “Inspect” function of your browser. Go ahead and try this right now!
Selecting Nodes: Cascading Stylesheets (CSS)
- What makes scraping modern webpages easy is that their look is determined by “Cascading Stylesheets” – CSS. How a node on a webpage is displayed depends on its tag and its attribute.
- We can take advantage of the language used to match these elements to scrape from websites. This language is called “CSS selectors”
- We’ll stick to the basics. You can find all CSS selector rules here:
CSS Selector rules
- To select a node of a name, simply use the node:
div
- To select a node anywhere within a different node, use the outer node followed by a space, then the inner node:
body h1
- To select an element with a given id use # followed by the id:
#abstract
- To select all elements of a given class, use . followed by the class name:
.article-text
. Remember: There can be (and often are) multiple elements with the same class. - To select a specific element of a given class (or id), simply join them together
div.article-text
- To select an element based on a different attribute, use
[attribute="value"]
, i.e.[description="special-content"]
Exercise
Looking at a specific Assembly Member, Khaleel Anderson, what would the CSS selector be for his district?
A Helpful Tool
- There are many tools that can help you to generate CSS selectors. I like the css selector gadget, which comes as a Chrome Extension and a bookmarklet
- By allowing you to select elements in and out of the selection, it makes it easy to generate a CSS selector
Exercise
-
Using the CSS Selector Gadget, on the directory of the Assembly find a CSS Selector that includes all email addresses, but nothing else.
-
Caution: The CSS Selector Gadget is wonderfully helpful, but it can lead you astray by producing overly specific selectors.