Introduction

Are you currently using Python or R to manage, clean, and/or analyze your data? Would you craft a narrative of your research process that includes a mixture of text, interactive code, and dynamic visualizaitons?

If so, Python and R are both excellent fits. Each programming language has a distinct ecosystem of data science tools that integrates well with data visualization. Furthermore, the new trend to share code and research as a narrative (such as in a "Notebook") relies upon these programming languages.

Choosing a language: If you already use Python or R, consider sticking to the language and ecosystem of packages you already use. If you're new to programming, keep in mind that each language has a very similar set of capacities. R has historically been used more in statistical and quantitative analyses, and Python is a general-purpose programming language used in everything from text analysis to astronomy. Consider what languages and packages folks in your discipline tend to use, but there's really not a wrong choice!

Python packages

Python is a general-purpose programming language that is used widely in the social sciences, physical sciences, digital humanities, etc.

To add data visualization functionality to your code, you must download a Python visualization package (e.g. using pip or an environment manager like Anaconda) and import the package into your script/program. (Read more: installing Python packages with pip; installing an Anaconda distribution)

List of widely used Python visualization packages:

  • seaborn: "Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics." seaborn is an excellent choice for high-level visualization that produces an aesthetically pleasing result in relatively few lines of code (seaborn website)
  • plotly: "plotly.py is an interactive, open-source, and browser-based graphing library for Python". like seaborn, plotly is higher-level and has additional interactivity features. (more about plotly on GitHub)
  • ggplot: ggplot is a port of the R library ggplot2 that like seaborn leverages the Grammar of Graphics theoretical approach to visualization (see more in Dataviz Best Practices).
  • pandas: Technically pandas is not a data visualization package so much as a fully featured data science library. pandas has a number of essential features like DataFrames, robust and well-optimized data structures built upon the NumPy library, etc. On top of all of this, pandas has native support for data visualization (as with seaborn and plotly, pandas visualizations are built on top of the Ur-library matplotlib). Consider pandas if you like to learn a complex and powerful data analysis package along the way! (pandas website)
  • bokeh: "Bokeh is an interactive visualization library that targets modern web browsers for presentation." bokeh is useful for creating dashboards (similar to Tableau's dashboard functionality). (bokeh website)
  • datashader: "Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data." datashader is a more specific tool than the ones above. You may find it helpful working with geospatial data analysis. (more about datashader on GitHub)

R packages

R is a programming language with strong use in statistical analysis and the sciences. R has been used as an open source alternative to proprietary statistical analysis packages like SPSS, SASS, and MATLAB. Over time, R has developed a broader and more robust set of features in data science and computational analysis broadly, helped in particular through the tidyverse ecosystem of data science packages.

As with Python, you will need to augment R with additional packages to add data visualization support. Most users interact with R through an Integrated Development Environment (IDE) such as R or RStudio -- additional packages can be installed via the install features in each environment. Read more: R packages: a beginner's guide

List of widely used R data visualization libraries:

  • ggplot2: "ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details." ggplot2 is the essential R data visualization package. It is integrated into the the tidyverse ecosystem. (read more on the tidyverse website)
  • leaflet: "Leaflet is one of the most popular open-source JavaScript libraries for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB.This R package makes it easy to integrate and control Leaflet maps in R." (read more via RStudio)
  • More packages via Awesome-R

Shiny for R

"Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R. Shiny helps you turn your analyses into interactive web applications without requiring HTML, CSS, or JavaScript knowledge."

To try out Shiny, here is a walkthrough for three example applications.

Jupyter Notebooks (Azure Notebooks)

Map of the United States with population data visualization showing distributions of sepal length in flower classes

Figure 1: Python code in a Jupyter Notebook with the resulting map visualization using the datashader library. View the Figure 1 source Notebook.

Figure 2: Introduction to Machine Learning via flower classification, using seaborn & plotly. View the Figure 2 source Notebook.

As you write code to generate visualizations, you may also wish to include more interactivity, transparency, and user control in your process. One way to accomplish this is to compose and share your work as a Jupyter Notebook.

A Jupyter Notebook is a single file that may include code, narrative/explanatory text (formatted as Markdown), and the outputs of running code. Users can share their notebooks as a .html page, a .pdf file, or an interactive notebook that can be run and manipulated by users on-the-fly.

(If you'd like to learn more about using Notebooks for effective storytelling, read: "The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool")

With a Jupyter Notebook, you can write Python code that uses packages like pandas and seaborn to generate visualizations. But unlike code running on your computer from a .py file, these visualizations will generate within the notebook itself. You can also run code written in other languages, including R!

For Cornell students, researchers, and staff: You have several options of how to run Jupyter Notebooks. If you wish to run it entirely off of the cloud (i.e. without downloading files to your computer, and shared with others via URLs), we have free access to Microsoft Azure's Notebook service. This is an excellent option for sharing work with others, using code and visualization in classes and workshops, etc. Alternatively, you can run Jupyter Notebook locally on your machine - for instance, it comes pre-installed with the Anaconda distribution manager. More local install information here.