Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Ph.D. Dissertation Summary

I try to answer the question, “What was your Ph.D. dissertation topic?” In short, I created five new tools to help people who write computer programs to extract insights from data.

In April 2012, I defended my Ph.D. dissertation, Software Tools to Facilitate Research Programming. Here I'll summarize what it was about.

(You can also read The Ph.D. Grind to learn how I came up with my dissertation topic and implemented the research that comprised it.)

So, what did I do for my Ph.D. dissertation?

One-sentence answer

I created five new tools to help people who write computer programs to extract insights from data.

One-paragraph answer

Tens of millions of people in fields such as science, engineering, business, finance, public policy, and journalism write computer programs to extract insights from data. By some estimates, these people far outnumber professional software engineers, yet few researchers have investigated the unique kinds of problems they face while programming. My Ph.D. dissertation describes a few technical challenges that these people often encounter and presents five new tools to address those challenges.

My official thesis statement

My thesis is that by understanding the unique challenges faced during research programming, it becomes possible to apply techniques from dynamic program analysis, mixed-initiative recommendation systems, and OS-level tracing to make research programmers more productive.

Overview of tools that I built

  1. Proactive Wrangler is an interactive graphical tool that makes semi-automated suggestions to help people reformat and clean their data prior to analysis.

  2. IncPy is a custom Python interpreter that speeds up the data analysis scripting cycle and helps programmers manage code and data dependencies. IncPy is the first attempt to integrate automatic memoization and persistent dependency management into a general-purpose programming language.

  3. SlopPy is a custom Python interpreter that automatically makes existing scripts error-tolerant, thereby also speeding up the data analysis scripting cycle. SlopPy supports fail-soft semantics, tracking provenance of code and data errors, and incremental re-processing of error-inducing records.

  4. Burrito is a Linux-based activity monitoring and in-context note-taking tool that helps people organize, annotate, and recall past insights about their experiments. It combines OS-level provenance tracking with HCI techniques from personal information management.

  5. CDE is a software packaging tool that makes it easy to deploy, archive, and share experimental code. CDE eliminates the problems of “dependency hell” for a large class of Linux-based software. So far, over 10,000 people have downloaded and used it in a variety of settings. It's my favorite project from my dissertation :)

Donate to help with web hosting costs

Created: 2013-06-18
Last modified: 2013-06-20
Related pages tagged as Ph.D. student life:
Related pages tagged as research:
Related pages tagged as data science: