Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

First impressions of the IPython Notebook

I've been using the IPython Notebook for less than a week, and I can already tell that there's no way I'm going back to my former data analysis workflow. Here are my first impressions.

My former workflow

Before discovering the IPython Notebook, I analyzed data with Python for many years using the following workflow.

I first create a new directory for each experiment and then enter the following loop:

  1. Write a Python script to do a particular analysis.

  2. Execute that script, which produces either textual output to the terminal or a diagram such as a bar graph. If the output looks reasonable, I save it to a file.

  3. Inspect the output, maybe show some colleagues, and then take notes in a text file to reflect on my analysis results.

  4. Think about how to adjust my Python script for the next round of analysis, and then loop back to Step 1.

This cycle repeats dozens of times per day, and thousands of times throughout the course of a project. My experiment directory gets littered with hundreds of files containing:

  • Python scripts,

  • outputs of those scripts, such as text files captured from terminal output and image files containing diagrams,

  • and notes taken to reflect on my analysis results.

There are no semantic links between those files, so I can't easily tell, say, which output diagram was produced by which script (running with which set of command-line arguments), or which part of a particular note refers to which output.

The best I can do to impose some order amidst this chaos is to give each file a sensible name and organize all files into a well-groomed directory hierarchy. Attempting to do so either leads to cryptic long filenames such as the ones shown in this monstrosity (click to see a full-sized screenshot):

or I just give up and name my files something useless like big_awesome_graph.png.

The extreme overhead of organizing my code, output, and notes takes time away from doing real work.

Data analysis with the IPython Notebook

Here is how I now do data analysis with the IPython Notebook.

I first create a new notebook file for each experiment and then enter the following loop:

  1. Write Python analysis code in a cell (code block) within the notebook.

  2. Execute that cell, which produces either textual output to the terminal or a diagram such as a bar graph. Both kinds of output are displayed directly in the notebook.

  3. Inspect the output, maybe show some colleagues, and then take notes directly beneath that output in the notebook.

  4. Think about how to adjust my Python script for the next round of analysis, start a new cell in the notebook (or modify an existing cell), and then loop back to Step 1.

This workflow looks eerily similar to my original one, with one superficial but massively important difference: Everything related to my analysis is located in one unified place. Instead of saving dozens or hundreds of code, output, and notes files for an experiment, everything is bundled together in one single file.

The IPython Notebook drastically reduces the overhead of organizing code, output, and notes files, which allows me to spend more time doing real work.

I no longer need to get annoyed by creating files like analysis_Y_results.alpha_200.beta_500.x_mode.png and remembering the exact conditions that produced each file. If I want to see which code produced a particular output image, all I do is look at the cell right above the image in the notebook.

Also, if I show an output diagram to my colleagues and they give me a suggestion, I can write it down as a note right below that diagram rather than putting it in a totally separate note file.

Details

Setup: I got IPython Notebook up and running very quickly thanks to the wonderful Enthought Canopy distribution. I just downloaded and installed the distribution, launched the Canopy IDE, and created a new IPython Notebook within there.

Update in Dec 2013: I found that the Canopy IDE on Mac is a bit laggy. What works better instead is installing Canopy as usual and then launching IPython Notebook directly in the Web browser by switching to my experiment directory and running:

ipython notebook --pylab=inline

Notebook organization: I typically keep one notebook file for each experiment, structured in the following way:

  • Title of experiment
  • Summary of major findings (which I update as the experiment progresses)
  • Setup code such as module imports
  • Code cell for first analysis
  • Output of first analysis
  • Notes reflecting on first analysis and jotting down ideas for what I should try to analyze next
  • Code cell for second analysis
  • Output of second analysis
  • Notes reflecting on second analysis
  • ...

I keep repeating the pattern of Code -> Output -> Notes for subsequent rounds of analysis until my notebook gets too big, and then I usually start a new notebook file.

Separating analysis and plotting cells: One useful idiom is to keep separate cells for analysis and plotting code, especially if the analysis takes a long time to run. The basic idea is to assign the results of an analysis to a global variable, and then, in a separate cell, parse the contents of that global variable to generate graphs. That way, you can experiment with many different types of plots without re-executing the (long-running) analysis code.

Refactoring into Python modules: The IPython Notebook is great for writing many small snippets of analysis and plotting code, but it's not a full replacement for traditional Python source files. Thus, after prototyping in the notebook, I usually refactor some of my code into individual Python files so that I can edit more comfortably using my preferred text editor. I then import those modules in the topmost code cell of each notebook.

Beware of global namespace pollution: Because all cells in a notebook execute in a single global namespace within one Python process, all top-level variables are shared across cells. While this is a major convenience most of the time, it can also lead to problems if you're reusing the same global variable in multiple cells and forget which cell was most recently executed. The most subtle of these bugs arise for temporary variables defined within loops, since those are still globally scoped and persist even after loop exit. When in doubt, just restart your IPython kernel and re-execute all of your cells from scratch.

Wish list: Here are some potential features that would improve the user experience of this already wonderful tool:

  • A more salient visual indicator of which cells have already executed, and in which order.

  • A more salient indicator of which global variables are currently in scope, and when they were last updated (e.g., by which cell).

  • To avoid global namespace pollution, add a toggle option to delete all global variables within a cell after it finishes executing, except for special variables marked as “persistent.”

  • Display timestamps for creation and most recent edit times for each cell, and also the last time each code cell was executed. Think of each cell as a “mini file“ with its own metadata.

  • Dependency tracking both between cells and with external code and data files, so that I can tell exactly when each cell output becomes stale and needs to be re-generated.

  • A “quick diff” mode where I can fork off a single cell beside the current version, update its code to try out a new idea, and then execute and see the results side-by-side with the old results to do a quick comparison.

  • More detailed undo/redo history for each individual cell. Imagine a simple slider-based UI where you can slide back and forth to see the cell's former contents.

Created: 2013-07-25
Last modified: 2013-12-02