Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

ILRST 2130 - Applied Regression Analysis (Fall 2022): "Fun" Data Sources

What you'll find on this page

This page contains data sets that don't fit into any particular category, and are often just interesting, entertaining, or utilize data to tell a great story. Many of these data were found via Data Is Plural, a (mostly) weekly newsletter that curates interesting or curious data. 

European Cross-Border Rail Data

This is a dataset from OBC Transeuropa that details passenger train routes in Europe that cross national borders. The file below has an excellent readme file in the first tab. 

Here are two articles that utilize this data to tell a story about European rail travel:

Four ways of looking at European cross-border rail links

More and more trains crossing European borders

 

LFB Animal Rescue Incidents

The London (UK) Fire Brigade published a dataset of animal rescue incidents they attended. Data are updated monthly. They note that the data was published because they are routinely asked to provide data on the 'special services' (i.e. non-fire incidents) they attend. The data are published on London's public data repository.

Bay Area Rental Prices

Economist Kate Pennington has collected two decades worth of Craigslist data to create a dataset of rental price info in the San Francisco Bay Area by scraping data from Craigslist via the Wayback machine. Both the raw and cleaned data is available for analysis. The cleaned data is linked below.

Movie and TV Data

IMDB provides several downloadable datasets for personal and non-commercial use. Datasets can be used individually or combined.

Amber Thomas also collected a dataset of the ages of actors and the characters they portray on teen television shows, in order to analyze the age differences between actors and their characters.

Tour De France

The Tour De France provides rider results dating back to 1903. You can't download the data from their site. However, Thomas Camminady, an applied mathematician, scraped their data to build a series of CSV file full of Tour De France rider and stage data. (Note: He does include some easy instructions on how to import the data into Python using pandas, but does not include similar instructions for R.)

Venture Studios and Startups

The Venture Studio Index is a free, public database of venture studios and their startups intended for founders, operators and investors working with venture studios.

Transit Costs

 “Why do transit-infrastructure projects in New York cost 20 times more on a per kilometer basis than in Seoul?” With the aim of answering questions like these, the NYU-based Transit Costs Project is building a dataset that already spans more than 500 urban rail projects around the world. For each project, the dataset specifies the city, start year, end year, rail length, number of stations, total cost, and more.

(Data description via data-is-plural.com)

Opera Performance Data

Operabase has gathered information about more than 500,000 opera performances staged since 1996. A dataset on six full seasons of opera stagings across hundreds of cities is available. The data were originally collected to support a study of how copyright affects opera performance frequency. (Linked below.)