LibGuides: Text as Data: Finding and Mining: Webscraping and APIs

Introduction: Webscraping and APIs

Webscraping is the process of collecting ("scraping") information available on Internet websites. This includes content from web pages, but it might also include information about the pages (data about data, which is also called "metadata"). For example, you might be interested in collecting the content on a series of Wikipedia pages, and you might also want to collect information on all the Wikimedia user account names that contributed to building the content of those Wikipedia pages, as well as when those pages were last updated. Webscraping is a broad term and can include many related tools and approaches that help enable users to "scrape" content from various web-based sources.

Application Programming Interfaces (APIs) are web-based tools that help facilitate access to data hosted on various platforms. Companies and organizations use APIs to help enforce controlled access to large amounts of information. For example, there might be a limit on the number of "calls," or attempts at collecting information, during a certain period of time. Further, you might only be able to access some subset of information from a website, rather than all of its content. The New York Times APIs, for example, will only deliver metadata about some articles, instead of providing both article metadata and full-text articles.

✨Getting started and getting help✨

For getting started with web scraping and APIs, try "Data Collection (Web Scraping, APIs, Social Media" by Melanie Walsh, in Introduction to Cultural Analytics and Python, ver. 1 (2021).

If you need help with webscraping or APIs and don't know who to ask, contact Cornell Center for Social Sciences for troubleshooting questions and the Digital CoLab at Cornell University Library to help navigate sources, licenses, copyright and other access-related questions.

Stay in the loop on upcoming workshops presented by the Cornell Center for Social Sciences and Cornell University Library:

Ethics and Privacy in Scraping Content

First things first: Before downloading a chunk of content from an online source, pause and think about the information you're collecting. While any publicly available information can legally be scraped under federal law, it doesn't mean that all public information needs to be scraped. Consider: What restrictions does the website place on the content it hosts? Did the authors or creators intend for their content to be used in this way? Is there any way that collecting this information might inadvertently harm someone?

These questions often have ambiguous answers depending on the sources you are looking at, but that doesn't mean you shouldn't consider the impact of scraping web content that reflects human experiences, insights and experiences.

The following resources lay out key considerations for privacy and ethics of gathering web-based information:

"Users' Data: Legal & Ethical Considerations,"
from Introduction to Cultural Analytics & Python, ver. 1 (Walsh, 2021):
Receive quick guidance on topics such as IRB reviews, citations, collaborating with online users when using their posts, and models for ethically using social media data.

Building Legal Literacies for Text Data Mining
(eds. Samberg & Vollmer, 2021):
Explore more in-depth explanations on legal, ethical and privacy-related issues in data mining.

Understanding HTML and Web Page Structures

HTML stands for HyperText Markup Language. It is the standard language for structuring web content. When you download any amount of content from the web, you will encounter HTML!

Familiarize yourself with the language by exploring any of the following resources:

HTML tags, elements and attributes (Geeks for Geeks)
What are HTML Attributes (Geeks for Geeks)
List of all HTML Attributes (W3 Schools)
List of all HTML Tags (W3 Schools)

Understanding JSON Files

Once you have scraped the data or collected it via an API, you might have a group of data files in JSON file format.

JSON files feature data presented in name-value pairs nested within different layers, or containers, called objects. JSON stands for JavaScript Object Notation and is written in the same language as JavaScript, which is the language many web platforms use to develop their sites.

See a sample pull of JSON data from an API at this link.

Programming Packages for Webscraping

Python

Prerequisites:

Before getting started with web programming, learn how the programming language Python works:

Python Basics
Python tutorials from W3 Schools

Beautiful Soup

Beautiful Soup Documentation
Python Beautiful Soup Workshop Recording (Cornell Center for Social Sciences, Fall 2023)
Sample BeautifulSoup Python Script File and Guide created by Cornell Center for Social Sciences
"Scraping Open Data from the Web with BeautifulSoup" blog post tutorial created by Jajwalya Karajgikar of Penn Libraries at the University of Pennsylvania

Selenium

R

rvest

"Web Scraping in R" tutorial by Antoine Soetewey (2023)

RSelenium

RSelenium documentation

Freely available resources and tools

MIT Libraries developed a handy list of tools and portals to help facilitate data mining, which we share below. Each linked resource connects to a page on the MIT Libraries guide site and includes explanations and access to links for each resource:

Creative Commons license: The content on this guide is under an attribution-noncommercial creative commons license (abbreviated CC BY-NC 4.0). You may use, share and adapt this content for noncommercial purposes as long as you give appropriate credit, indicate if any changes were made, and do not place stricter restrictions than are allowed by the CC-BY-NC 4.0 license. Learn more about this license on the Creative Commons web page linked here.