Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Why Wrangle Data?

I wrangle data to make it easy to write fast-running analysis scripts that can answer the research questions that my colleagues and I pose.

Over the past week, I've been doing a lot of data wrangling as the first phase of my current research project. I just wrote an article called Data Wrangling with MongoDB to discuss some technical details. In this article, I want to get a bit more abstract.

Why wrangle data in the first place?

In short, I wrangle data to make it easy to write fast-running analysis scripts that can answer the research questions that my colleagues and I pose.

Let's now unpack that sentence.

Say you're a researcher with questions that can be answered by analyzing a given data set. Also, assume that the raw data you've been given contains all of the information required to answer your question (i.e., it's complete). In theory, you should be able to write a script to analyze the data and get your answer. Simple, right?

Well, here are two common problems that might arise:

  1. It takes you a long time to write and debug your script.
  2. Your script takes a long time to run.

If you haven't encountered either of those problems, then congrats! Stop reading now.

But if you're doing non-trivial analyses on non-trivial amounts of data (e.g., over a gigabyte), you'll know exactly what I mean.

The producers of the raw data set you're given probably didn't think about your particular use case then they came up with the schema. Thus, you have no choice but to wrangle the data into a format that's more amenable to your analyses. This process often involves creating a bunch of derived data sets containing a reshaped subset of the raw data, each one optimized for a particular class of analyses.

To alleviate the aforementioned pair of problems, wrangle your data in such a way that when someone comes up with a new research question:

  1. You can quickly write and debug a script to answer it.
  2. Your script terminates quickly with a result.

In short, make your iteration cycle as fast as possible, so that you can run more experiments per day.

For instance, if your colleague poses a question and your head explodes for an hour writing a script, and your script takes 4 hours to terminate, at which point you discover a bug and need to fix and re-run for another 4 hours, then you'll get only one iteration per day. That majorly sucks!

On the other hand, if you've wrangled your data in such a way that when your colleague poses a question, it takes you only 10 minutes to write a script to process the appropriate derived data set, and your script takes only 3 minutes to run, then you can get at least 4 iterations per hour. That's over 25 iterations per day.

The faster you can iterate, the more likely you'll discover insights about research questions that you, your colleagues, your boss, or other stakeholders pose. That's why it's incredibly important to wrangle your data into a format where you can minimize both the time it takes you to write scripts and the time it takes for your scripts to execute. Data wrangling might take time away from doing your actual research, but I guarantee that you'll be more productive and insightful when your data is in the proper format for your analyses.

(However, don't waste time prematurely optimizing. Try working with the raw data first to see if your iteration cycle is fast enough; and only wrangle when you see a real need to do so. Suffer a bit first, so that you can have a better sense of how exactly to wrangle your data to minimize your suffering. In my case, if I can't get my analysis scripts to terminate in less than 5 minutes, then I need to do some wrangling.)

Donate to help with web hosting costs

Created: 2013-06-29
Last modified: 2013-06-29
Related pages tagged as data science: