LibGuides: ILRST 2110 Statistical Methods for the Social Sciences II : "Fun" Data Sources

Summer Olympics Medalist Data

Giorgio Comai, from the European Data Journalism Network, created a dataset of Summer Olympic medalist info using Wikipedia data. There is also an interesting spatial data component. Data are primarily from the 2020 and 2024 Olympics with some data available for other past games as well.

Olympic Medalist Data

"Mutant Moneyball"

"Anderson Evans’s Mutant Moneyball project uses comic book market data to explore the financial value of individual X-Men characters. The project’s dataset provides decade-by-decade statistics for 26 members of the team, drawn from sales histories and pricing guides, as well as a matrix indicating the issues in which each character appeared." (From Data is Plural.)

Mutant Moneyball

Database on Ideology, Money in Politics, and Elections (DIME)

"The Database on Ideology, Money in Politics, and Elections (DIME) provides a general resource for the study of campaign finance and ideology in American politics. The database was developed as part an on-going effort to construct a comprehensive ideological mapping of political elites, interest groups, and donors. Constructing the database required a large-scale effort to compile, clean, and process data on contribution records, candidate characteristics, and election outcomes from various sources. The current database contains over 500 million itemized political contributions made by individuals and organizations to local, state, and federal elections covering from 1979 to 2022. A corresponding database of candidates and committees provides additional information on state and federal elections."

Database on Ideology, Money in Politics, and Elections (DIME): Public version 3.1

Iowa Liquor Sales Explorer

The state of Iowa tracks spirits purchase information for grocery stores, convenience stores, and similar establishments, and makes that information available to the public. Users can download sales information down to the bottle level dating back to 2012, for a fascinating picture of commerce in Iowa. Data is filterable (using the filters will filter the data on all charts shown) and a CSV file is able to be downloaded by clicking the three dots in the top right-hand corner of each individual chart's box.

Iowa Liquor Sales Explorer Data

Billboard Hot 100 Hits

"For his upcoming book, Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves, Chris Dalla Riva has compiled a dataset of all 1,100+ Billboard number one hits from 1958 to early 2025. For each song, the dataset includes information about the artists, songwriters, producers, and label; genre, time signature, keys, BPM, the presence of various instruments; song structure and lyrics; whether the song was entered into Eurovision; and much more." (Description from Data Is Plural)

Billboard Hot 100 Hits Database (Google Sheet)

National Database of Childcare Prices

From Data Is Plural: " The National Database of Childcare Prices, launched in January by the Department of Labor’s Women’s Bureau, “is the most comprehensive federal source of childcare prices at the county level.” For each county and year from 2008 to 2018, the dataset provides estimates of the median and 75th-percentile weekly cost, disaggregated by provider type and child age. The estimates are calculated from the market surveys the federal Child Care and Development Fund requires participating states to conduct."

National Database of Childcare Prices
Direct link to data download
Technical Guide
Data dictionary starts on page 34 (Appendix D)

Movie and TV Data

IMDB provides several downloadable datasets for personal and non-commercial use. Datasets can be used individually or combined.

IMDB Datasets

Amber Thomas also collected a dataset of the ages of actors and the characters they portray on teen television shows, in order to analyze the age differences between actors and their characters.

Age of Characters and Actors in Teen TV Shows

Tour De France

The Tour De France provides rider results dating back to 1903. You can't download the data from their site. However, Thomas Camminady, an applied mathematician, scraped their data to build a series of CSV file full of Tour De France rider and stage data. (Note: He does include some easy instructions on how to import the data into Python using pandas, but does not include similar instructions for R.)

Thomas Camminady's Tour De France dataset

European Cross-Border Rail Data

This is a dataset from OBC Transeuropa that details passenger train routes in Europe that cross national borders. The file below has an excellent readme file in the first tab.

Here are two articles that utilize this data to tell a story about European rail travel:

Four ways of looking at European cross-border rail links

More and more trains crossing European borders

European Cross-Border Rail Data
Data from OBC Transeuropa that identified 271 passenger train routes that cross Europe's national borders. Data is available in Google sheets.

Wait Wait...Don't Tell Me!

Linh Pham provides a wealth of structured data about NPR's popular quiz show, dating back to 2007. The data are also available via API.

Wait Wait...Don't Tell Me!

Katherine Dunham

Dunham’s Data: Katherine Dunham and Digital Methods for Dance Historical Inquiry is a digital humanities project that uses 20th-century African-American choreographer Katherine Dunham as a case study. The data is curated from a large body of undigitized primary source materials. Interesting visualizations are included, but the raw data are also available for you to explore and conduct your own analysis.

Dunham's Data - Datasets

USPS Performance Data

As part of Jones v. United States Postal Service, a federal lawsuit filed in August, USPS must submit weekly performance reports that indicate, at a national and district level, the percentage of mail that was processed (though not necessarily delivered) on time. The agency files these reports as PDFs; Save the Post Office, a decade-old website run by a retired English professor, has been collecting those PDFs and converting them into spreadsheets. Related: Aaron Gordon’s pre-election analysis of the USPS data

(Data description from data-is-plural.com)

USPS Weekly Service Performance Reports
Data submitted in Jones v USPS

Transit Costs

“Why do transit-infrastructure projects in New York cost 20 times more on a per kilometer basis than in Seoul?” With the aim of answering questions like these, the NYU-based Transit Costs Project is building a dataset that already spans more than 500 urban rail projects around the world. For each project, the dataset specifies the city, start year, end year, rail length, number of stations, total cost, and more.

(Data description via data-is-plural.com)

Transit Costs Project Data
What does the data say?
Data visualizations provided by the Transit Costs Project to give you an idea of the story the data are telling, and some of the ways it has been analyzed thus far

Opera Performance Data

Operabase has gathered information about more than 500,000 opera performances staged since 1996. A dataset on six full seasons of opera stagings across hundreds of cities is available. The data were originally collected to support a study of how copyright affects opera performance frequency. (Linked below.)

Link to Operabase replication data
Grand rights and opera reuse today - Alexander Cuntz
Article that uses the operabase data to examine copyright availability and the staging of operas.

ILRST 2110 Statistical Methods for the Social Sciences II : "Fun" Data Sources

What you'll find on this page