Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Tips for performing computationally-based experimental research

In this article, I describe tips relevant to researchers, especially Ph.D. students, who perform any type of research involving the use of computers to run experiments. I aim to describe concrete tips relevant to helping make the daily grind of performing computationally-based experimental research more productive, effective, and enjoyable.

In December 2017, I dug up this incomplete article draft from my old documents archive. I wrote this draft at the start of my 3rd year of Ph.D. nearly a decade ago. I think the overall principles remain relevant today, even if some of the technical details are old-ish. If I were writing this article today, I'd talk more about using Jupyter Notebooks, the modern R tidyverse ecosystem, GitHub for version control, and Dropbox as an always-on backup and version control system, but those technologies weren't in common use in 2008.

On a meta note, I found it striking that so early on in grad school, I was already reflecting heavily on the challenges of research programming that I was personally facing. These struggles would directly inform the design of the five tools that eventually formed my Ph.D. dissertation a few years later.

Finally, a lot of this is relevant to data science, except of course that word wasn't in common use back in 2008!

[incomplete draft]

Created: 2008-09-22
Last updated: 2009-03-19

TODO: link to more concrete examples (give at least one for each bullet point)

Motivation for this article

In many areas of modern science and engineering, researchers who are not trained as computer scientists or programmers find themselves having to use computer-based tools to perform their research. While they are often experts in their respective fields, they are not experts at computer systems or programming (nor should anyone expect them to be), so they often do not know about the most effective ways to leverage the power of computers to aid in their research. This article describes a collection of tips I have accumulated from doing computationally-based experimental research over the past few years, which aims to help other researchers make better use of computational resources in their work.

While doing web searches for related articles prior to writing this one, I actually found none that provided the type of advice that I wanted to present. I found many online advice guides for how to excel as a Ph.D. student or researcher, but those mostly focused on high-level 'career advice' such as how to develop research ideas, apply for grants, publish papers, collaborate in teams, communicate and sell projects, etc. This article is definitely not one of these 'Ph.D. student advice' guides (it would be presumptuous and non-credible for me to write such a guide since I'm still in the midst of my Ph.D.). Instead, I aim to focus solely on helping people optimize aspects of the daily grind spent sitting in front of the computer.

Taking notes

Experimental research involves lots of trial-and-error --- exploration of paths that are mostly unfruitful. Thus, it's important to document lessons learned during trials in order to maximize chances of success in the future and to reduce the probability of repeated mistakes. Even more important than writing down notes is the ability to find the relevant notes later when you need them.

  • You can use whatever software you'd like to take notes, but make sure that it can save your notes in files that you can easily back up (more on backing up your files later [TODO: link to section]). I store most of my notes in plain text files, and I put some notes online in a wiki format. You can use word processors, spreadsheets, or other more specialized note-taking software as well.

  • Develop an organized note-taking system with multiple locations for your notes. Don't just have one giant file where you cram all of your research notes, because you will either:

    • write too much and bury the significant insights amidst daily drudgery

    • write too little and do not document enough of what you do for fear of burying significant insights

  • I keep a todo list in a text file; whenever I want to queue up a task that I need to do for work but don't have time to get around to it immediately, I add an entry in my todo list.

  • I keep a work log in a text file; at the end of each day I append an entry with a few bullet points, each briefly describing one concrete task I accomplished at work that day.

  • I keep a wiki where I write more permanent research-related information such as experimental procedure instructions and high-level insights from results. A wiki format allows me to have slightly more structure than text files, to add hyperlinks between wiki pages (and to other pages on the Web), and to share my results with others.

  • I like taking more detailed notes in locations nearest to where they will be most useful:

    • I write notes on specific experiments in text files within the directories holding the results of those experiments.

    • I write notes directly relevant to specific programs or scripts as comments at the top of their source code files (where they are most easily read upon first opening the file).

  • I carry around a physical research notebook to make hand-drawn sketches and to take informal notes. I simply append entries by date to the notebook and transfer my notes to electronic form only when they seem useful enough. Most entries in my notebook never get transferred anywhere else (or even read again), but the act of sketching helps me to develop and refine my ideas.

  • Don't over-organize. Do something reasonable that works for you, but don't spend too much time agonizing over optimizing your note-taking habits. The point is to have your notes benefit you, not to overwhelm you with maintaining and organizing them. Unfortunately, most of your notes won't be useful directly, so don't go overboard in making them perfect.

  • Try to write down what trials did not work as well as what worked; this might be difficult to do (due to the sheer amount of things that simply don't work in research), but it will help you avoid repeating the same mistakes again.

Automating routine tasks

A sight that greatly frustrates me is seeing my friends who are extremely smart people spend lots of time doing repetitive grungy tasks on the computer, especially when grinding on research. I know that their time is far more valuable than the computer's time. Computers are great at performing well-specified, boring, repetitive tasks, and don't grow jaded from the drudgery. Humans are meant to do smart things, and computers are meant to do dumb things.

If you find yourself repeating yourself enough on the computer, it's time to seriously consider learning how to write programs (commonly called scripts) to automate those tasks. If you aren't sure whether your particular task can be automated, chances are that it can, so ask the nearest computer expert around you about how to do it.

  • The up-front fixed cost of learning to write scripts to automate tasks is more than worth it in the long run. You can learn incrementally on an as-needed basis. You don't need to master a particular programming language to begin writing useful small scripts using it.

    • The excuse of 'oh well I'll tough it out and do this by hand because it will take me too much effort to learn to write a script' isn't gonna cut it!

    • Learn to automate ASAP; those skills will reap dividends many-fold the longer you stay in research.

  • Keep actively learning to improve your scripting skills. Read your colleagues' scripts to learn new tips and tricks appropriate for your particular area of research or the tools that you employ.

  • Research is all about exploring down different paths, so the easier it is for you to pursue a new path, the more likely you are to actually pursue it. Automation lowers the barrier to exploring new paths, which in the long run will yield better results (exploring many paths + gaining insights from failed trials = better results).

  • Don't be afraid of your scripts being a bit messy at first, because you can often hack up something quick-and-dirty much faster than something clean and pristine.

    • In research (unlike in, say, professional software development), the exact requirements or specifications for scripts are not particularly clear up-front, since by definition, you don't precisely know which path you are exploring.

    • Therefore, it is often futile to design your scripts in a monolithic, top-down manner.

    • Instead, what works far better is to implement some basic functionality quickly, actually use the script in your experiments, look at the results, and use the insights you gain to add more functionality on an as-needed basis.

  • Don't try to over-engineer your scripts. Stay agile and lightweight, implementing simple functionality first and then additional functionality on an as-needed basis. You are not in the business of engineering a consumer product filled with lots of neat features; your scripts are meant to facilitate your research, so don't over-engineer them! Resist the temptation to make your script cooler or more feature-filled, because nobody cares about your script; people only care about the results you get from your research.

  • However, learn to recognize when it becomes worthwhile to refactor your code so that it is easier to expand more on it later. Refactoring is the process by which you make your code 'cleaner' while preserving identical functionality as the original code. Refactoring is usually only worthwhile in scripts that you actually want to maintain Here is my refactoring methodology:

    • Before refactoring, run some representative test cases to generate a few outputs (e.g., in the form of text files) and store them in a directory to use as regression tests. These outputs represent the behavior of your code before refactoring.

    • Now begin refactoring your code bit-by-bit, and occasionally re-run your refactored program on your original representative test cases. Compare these outputs against your original regression versions and make sure that nothing has changed. You can automate these comparisons using, say, textual diffs. Get in the habit of constantly running refactored versions of your code on your regression test suite to give you confidence that the changes you are making are not altering the behavior of your script.

  • Instead of writing a complicated, long-running script to process your whole dataset, can you first do something simpler to just process a few selected samples? Be wary when you have to do a lot of work before you start seeing results. There are many more dead-end roads than fruitful roads in research, and the sooner you can see that your current path is not fruitful, the less time you'll waste.

  • Learn to debug your scripts effectively. You will inevitably run into bugs, and if you don't know how to debug well, then you will be fuming in frustration! Inserting print statements is a fine starting point, but you should eventually learn to use a debugger if one is available for your language/environment of choice.

Long-running computational jobs

  • Figure out whether you can run long-running jobs on machines other than your own, preferably in parallel on several machines. Machines are super cheap compared to your time! It might be worth it for you to bug your advisor to spend $500 buying you a new computer (no monitor/keyboard/peripherals needed) if it will improve your efficiency. 4 machines only cost $2000, which is like your monthly salary!

  • Learn to run several jobs overnight so that you can have fresh results when you come in in the morning; a good rule to follow is that your machines should always be cranking to produce new results for you. Always keep them busy, even overnight, that's their job!

  • When you running a long-lasting computational task, make sure you can see intermediate results (e.g., output to a text file) so that you can monitor its progress to make sure everything is going according to plan.

  • Even better, plan for your long-running tasks to be easily restartable in case something goes wrong in the middle; structure your task such that you don't have to start all over again but rather can resume where you left off, with as little wasted work as possible. This might not be such a big deal on an overnight run, but repeating a week-long run from scratch can be painful!

  • When writing these programs, maybe prefer fail-soft versus fail-fast software design --- i.e., prefer getting incomplete results over crashing with no results. For instance, when a particular part of the input is invalid, skip it and log an error message rather than crashing the entire application.

Organizing experimental data

  • Organize your results directories in some systematic way, and more importantly, record enough meta-data within those directories such that you can reproduce those results later and know what parameters went into creating those results.

  • A directory of results is useless without the accompanying information about what you used to generate those results. This is especially important in research because research usually involves tweaking lots of parameters (knobs), so you need to precisely record what settings you used for those knobs.

  • Get comfortable with storing your results in various file formats, especially intermediate results of long-running computations

    • simplest format: plaintext (could be easily compressed)

    • trees: XML, serialized data structures (json, python pickle, java serialize, etc.)

    • tables: comma-separated values (.csv), relational database (mysql, sqlite)

Visualizing data

  • Figure out easy and effective ways to visualize and view your data. Text is oftentimes not ideal. For richer text format, you can think about HTML or Excel tables. Or graphs and other visualizations. Being able to get an intuition about complex datasets is often key in guiding your research; learn more about data visualization and how to integrate it into your workflow.

    • An advantage of using HTML (perhaps made dynamic using JavaScript) for making simple data tables and visualizations is that it's quite easy to share your results with your colleagues. After all, everyone has a web browser!

    • Another advantage of HTML is the ability to place hyperlinks between pages of data

    • Of course, you will most likely want to write scripts to generate your visualizations, since you'll likely tweak your experiments many times and re-graph

Asking your local computer experts for help

  • Strive to find the best software tools available to do what you want to do. Unfortunately, it can be hard to know what you're looking for if you don't have much experience, so the best way is to find a friend who is a computer scientist, talk to them about the computational challenges you're facing, and ask for their recommendations.

    • If the desired tools are commercial, then talk your advisor into buying you a license (maybe academics get a discount).

    • Fortunately, there is a plethora of free and open-source tools for research and engineering available on UNIX-based platforms, which is why Linux and Mac OS are so attractive for researchers

  • Ask people for help when you don't know how to do something. Recognize that there is probably a better way to do things than you are currently doing, and try to ask people for technical tips.

    • Lots of grinding knowledge is implicit so people don't write them down.

    • People can be far more useful than Google searches

    • Learn to formulate your questions precisely in order to elicit the most effective responses.

Optimizing your use of time

  • Minimize your edit-interpretResults-debugAndRefine cycle; make it so short that you can quickly make many tight iterative loops; the more times you can try, the more you can learn from your mistakes and the more insights you will gain

  • Recognize your own inefficiencies and always strive to make yourself more efficient, up to a point only, of course. It's also bad to become obsessed with optimizing your workflow to the point where you're spending more time optimizing than actually working ;)

  • Monitor how you spend your time on the computer during the course of your work days and strive to optimize what you spent the most time doing. e.g., if you edit text most frequently (e.g., programming, writing, editing), then learn to use a text editor DAMN WELL. If you use spreadsheets most frequently, then learn to use it DAMN WELL. Optimize where you can get the most gains.

  • You should be spending as much of your time doing things that require your intellect as possible and minimize the amount of time wrestling with your computer and tools ... all of that time is not only wasted but is demoralizing. Sure, you'll still have to do grunge work to build and debug your tools, but those steps cannot be automated, so you are still working at 100% efficiency

  • Multiple monitors - screen real estate is key for grinding! Ask your advisor to buy a second monitor. It's worth it!

  • Multiple virtual desktops - You can keep emails and other distractions on a separate desktop which you can switch to occasionally in order to procrastinate - keep each separate ... see no evil, do no evil

Data backup and version control

  • Backup your data, because nothing is more frustrating than losing the fruits of your hard grinding labors! Disk space is cheap.

  • TODO: Talk about version control repository for code and even notes - more heavyweight than backups so use them less often

    • VERSION CONTROL FOR YOUR CODE IS ABSOLUTELY NECESSARY!!! (... even for solo work!!! link to my GIT article)
  • TODO: ask to see your friends' or colleagues' shell configuration files (.bashrc, .bash_profile), text editor customization files (.emacs, .vimrc), or common scripts they use to speed up their shit. Lots of this shit is 'tricks' that you pick up along the way and is hard to learn in a vacuum. it's far better to pick off bits of other people's experiences because they have what works for them in their .vimrc files and such.

Example of a computational tool suite

These are computational tools that I often use in my own research.

  • UNIX command-line utilities and programs for computationally-intensive tasks

  • Python for analysis and glue code

  • R for numerical processing and data visualization

  • HTML and JavaScript for creating and sharing tabular data

  • GIT for version control

  • Unison for data backup

Subscribe to email newsletter
Donate to help pay for web hosting costs

Created: 2008-09-22
Last modified: 2017-12-29
Related pages tagged as data science:
Related pages tagged as research: