Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

10th anniversary of coming up with my first independent research project idea

Today is July 24, 2019, and exactly ten years ago I came up with the idea for IncPy. By that point I had been working in research labs for over six years as both an undergraduate and graduate student, but this was the very first research project that I independently came up with on my own. I recounted that fateful moment at the end of Intermission chapter of the Ph.D. Grind:

“And then, on July 24, 2009—halfway through my internship—inspiration suddenly struck. In the midst of writing computer programs in my MSR [Microsoft Research] office to process and analyze data, I came up with the initial spark of an idea that would eventually turn into the first project of my dissertation. I frantically scribbled down pages upon pages of notes and then called my friend Robert to make sure my thoughts weren't totally ludicrous. At the time, I had no sense of whether this idea was going to be taken seriously by the wider academic community, but at least I now had a clear direction to pursue when I returned to Stanford to begin my fourth year.”

I described the IncPy project in detail in Year Four of The Grind. (Unrelated: Robert and I now co-host a podcast.)

I don't want to get too sentimental, but IncPy launched my research career over the past decade. Amongst other things, it:

  • became the first project of my Ph.D. dissertation.
  • got me talking to lots of computational researchers and data scientists (long before that term was popular) about their technical workflows, which gave me early intuitions about many of the populations that I study in my current research.
  • gave me the confidence that I could conceive, implement, and write a research paper on my own independent project (my advisor Dawson provided great moral support and high-level feedback but was not involved hands-on in this project) ...
  • ... presenting a talk on that paper at ISSTA 2011 enabled me to meet keynote speaker John Regehr; afterward he wrote a trip report blog post that mentioned IncPy, which blew me away since I realized that a well-known professor paid attention to a project that an unknown grad student (me!) came up with, which was unrelated to any larger faculty research agenda.
  • also enabled me to meet Fernando Perez (co-creator of Jupyter) and get exposed to the challenges of performing reproducible scientific research ...
  • ... which then went on to inspire two other Ph.D. dissertation projects, CDE and Burrito, which I cover in Year Five and Year Six, respectively; both projects came directly out of what I learned from working on IncPy.
  • also enabled me to attend the TaPP 2010 workshop, where I met Margo Seltzer, whom I would later work with in Year Six and inspire me to become a professor.
  • inspired the inception of Python Tutor, which went on to launch my current research in scalable online learning tools.
  • sparked my desire to have people actually use the research tools that I built, even though there were no direct career rewards for this in academia; only 3 strangers from the internet ended up using IncPy and telling me about it, but that was still encouraging. CDE would go on to get tens of thousands of downloads and be included in several major Linux package managers, and Python Tutor has gotten millions of users so far.
  • got me deeply interested in the pains of software dependency management and the challenges of deploying research prototype software, since it was really hard to get IncPy working on potential users' computers; these frustrations directly inspired CDE and a series of later projects (e.g., CodePilot, DS.js, Mallard, Fusion, Torta, Porta) that lower the barriers to novices getting started on programming and data science without wrestling with software environment issues.

I could go on and on ... but basically without IncPy I wouldn't have a research career at the moment. I could've probably found some way to finish up my Ph.D. by contributing to other people's research agendas, but I wouldn't have had the skills or motivation to continue doing this for a living after graduating.

Initial Project Notes

I thought it'd be fun to share my notes from that summer day ten years ago when “I frantically scribbled down pages upon pages of notes and then called my friend Robert to make sure my thoughts weren't totally ludicrous.” Looking back, I'm amazed at how clear the idea was in my head even from that inception moment; the final ISSTA paper largely followed this initial pitch. Here it is, unedited from the top of my IncPy project notes file:


Idea conceived on: 2009-07-24

Problem: People who write ad-hoc data analysis scripts (in, say, Python) often need to explicitly save intermediate results to disk in order to have their scripts run quickly when making incremental changes.

Proposed solution: Hack the Python interpreter so that it monitors how long each function executes for and then selectively memoizes the results of expensive but side-effect-free functions to disk, and then uses those cached values on subsequent runs (until the underlying data changes).

Target user audience:

  • Bioinformatics researchers
  • Natural language processing researchers
  • Data mining researchers
  • Business data analysts
  • Scientists who do programming for research (e.g., NumPy, SciPy)
  • Anybody who has to prototype script variants while processing datasets

What this project is NOT:

  • A methodology for parallel programming
  • An attempt to speed up interpretation of Python bytecodes using a JIT (this approach can transparently be combined with a JIT)
  • A new programming language or variant

Claims of effectiveness:

  • Does not require any re-writing or re-compilation of existing code
  • Does not require scientific programmers (who are not uber-hacker-wizards) to learn a new language
  • My hypothesis is that it will work fairly well in practice since most data analysis code is written in a functional manner
  • It will REDUCE THE ITERATION TIME for scientific programmers and also make their code more maintainable since they don't need to worry about explicitly saving intermediate data

Challenges:

  • How to effectively implement the static purity analysis (since Python is a dynamically-typed language)
  • How to evaluate this so that it can be publishable
    • Perhaps this is appropriate for a scientific computing TOOLS conference (e.g., where SciPy and NumPy and friends are published)

Possible next steps:

  • Talk to bioinformatics people and other scientists who write analysis scripts in Python and see their code to see what other idioms I can exploit and what their needs are ...
    • e.g., Marc, Imran, Cory McLean

---

Some details:

Motivation: I've written lots of ad-hoc data analysis scripts with basically the following workflow:

[input file] -> parse file -> do processing -> [output result]

now the 'do processing' step might be sophisticated, so oftentimes I want to serialize intermediate results to disk so that when I re-run my script, I don't have to process the entire base file all over again. e.g.,:

[input file] -> parse file -> process 1 -> [intermediate file] -> process 2 -> [output result]

now I might have multiple input files, multiple intermediate stages, etc., and soon this starts getting really annoying. what i would really like to do is to write a straight-up procedural Python program that does ALL the processing from input to final output and NOT HAVE TO EXPLICITLY SAVE AND RESTORE INTERMEDIATE FILES. the interpreter should be smart enough to realize when it can memoize intermediate results to disk (and later read them back). in the general case, this is a HARD problem, but i think that it's quite doable if we restrict our domain.

let's imagine memoizing at a function level:

def foo():
  a = bar()
  b = baz()
  return a + b

def bar():
  <pure, no dependencies>

def baz():
  for line in open('data.txt'):
    ...
  return <something>

When the interpreter executes 'x = foo()' for the first time, foo(), bar(), and baz() must all be run. The return values are all memoized to disk (if they had parameters, their parameter values would also be memoized). Now if I execute the program again and data.txt has changed (i.e., I updated my original dataset), then when I run 'x = foo()', it will NOT need to run bar() again since it can use the memoized value, but it WILL need to run baz() again since data.txt has changed.

(Note that it's only worth memoizing if we observe that a function takes a LONG TIME to run. How long? Maybe longer than 10 seconds? Otherwise, memoizing isn't really worthwhile and can introduce extra disk read/write overhead, which can easily dominate actual execution time.)

If I modify any function, then its result will have to be recomputed.

This is really like using Makefiles to minimize the amount of source files that need to be compiled and linked, except that we are operating on the granularity of individual Python functions. We must solve the sequence of data dependency constraints to execute a conservative over-approximation of the functions.

Note that this formulation requires the programmers to at least use the function as a level of abstraction. If he/she simply wrote all the computation together in one large function, then we can't memoize anything.

Also, note that we can't memoize unless we can prove that the function is pure. That might be tricky to do, but I suspect that lots of functions written for data analysis scripts are pure (or nearly pure).

Keep this website up and running by making a small donation.

Created: 2019-07-24
Last modified: 2019-07-24
Related pages tagged as research: