LibGuides: Collections as Data: Collecting and Visualizing: Gathering Data

Finding datasets

For collections-as-data work, your dataset can be anything from a group of objects in a particular museum collection to a list of films produced during a certain period of time. The only criteria is that the data must form a collection or collections. While we often work with visual media like images and video as our digital objects in a dataset, your dataset can be a collection of texts as well. Some places to find datasets include:

Responsible Datasets in Context
Academic library digital collections
Large museum collections
Library of Congress Selected Datasets
Awesome Public Datasets
Open Knowledge Labs
Any public CollectionBuilder project or GitHub repository

While there are some places you may find pre-made datasets, it's often necessary to create your own multimedia dataset so that you can explore hitherto unearthed trends, patterns, and absences within collections and interpret them. One of the differences between text analysis and collections-as-data analysis is that the latter can include materials across different media and formats using a single metadata scheme.

For any individual digital objects and data that you work with, you should consider copyright & license restrictions, depending on where you would like to collect the objects. For example, some images available in Cornell University Library’s d igital collections have restrictions on what you can download and how you can use those images.

Creating your own dataset

Before you can begin to create your own dataset, you need to understand metadata. Metadata is the backbone that drives search and discovery across many public-facing digital collections and exhibits. Metadata is what distinguishes digital objects from one another and is what the machine can read to find patterns across digital objects and collections. If you do not have a strong grasp of metadata, please see the understanding metadata section before continuing.

Metadata for digital projects is often stored in spreadsheet technology, though it can be stored and hosted in any place where it will maintain its regularity and standards, where it can be downloaded and read OR read directly in a machine-readable format, and where it can be easily tied to the digital object it describes. Spreadsheets are a way to create tabular data, which is a data type that most machines are capable of parsing. Spreadsheet software is also usually free, open-source, freely-hosted or host-free, and accessible, making it perhaps the best option for putting your datasets together.

Picking a Spreadsheet Tool

What spreadsheet tool you decide to use should be dependent on a few different factors: reliability, ease-of-use, and budget. Some tried-and-true spreadsheet tools are:

The tool doesn't particularly matter, as long as it can export files into commonly-used file formats for computational work like CSV and TSV. If the program you want to run your dataset through uses a file format that your spreadsheet tool does not export to, you'll have to use some sort of external file convertor like Cloud Convert, which is not always easy or secure.

Filling in your spreadsheet

In the Understanding Metadata section, we discussed how to devise a metadata scheme. If you haven't already, go see that section first so you know what metadata fields you'll be adding to your sheet.

Below is the basic structure of a metadata spreadsheet. The first row is a special row that includes all of your fields across the top of the spreadsheet. Each row after that (rows 2, 3, 4, etc.) are for an individual thing or piece of data. The first row, first column (A1 on most spreadsheets) is where you put the unique identifier for each object. Each column is a different field.

identifier	field 2	field 3	field 4	field 5
thing1
thing2
thing3
thing4

For an example of how this works, see the sample spreadsheet below. Row 1 is for my metadata field. Each metadata field then has its own column (A, B, C, D, etc.) and each thing or piece of data has its own row (2, 3, 4, 5, etc.).

Green highlighted fields indicate fields that are required by the software I want to put my data into. Most software/tools have required fields that you must include in order for the tool to be able to properly process your spreadsheet. It's easiest to look these up before you create your metadata spreadsheet, but you can always add them in later.

To sum up, the basic steps for filling out your spreadsheet are:

List out your metadata fields across the top row of the spreadsheet. Make sure to check whether the software/tool you want to put your data into has any required fields.
In the column that has the identifier field, list out the identifiers for each item in each row down the column (A2, A3, A4, etc.). You’ll be creating a new row for each item.
Fill out the rest of the values for each item, according to the field that’s in each column.
Download your file in whatever format your software/tool needs. Most collections tools, for example, accept comma separated values (.csv) files. A good spreadsheet tool will be able to download and share your metadata spreadsheet in a bunch of different file formats.