Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Data Wrangling with MongoDB

Summary
Here are some notes on wrangling data stored in MongoDB prior to doing meaningful analyses.

I'm spending my summer analyzing data stored in a ~1 terabyte MongoDB database (version 2.2.3). However, before doing any meaningful analyses, I need to first wrangle (i.e., clean, filter, integrate) the raw data into a format that's well-suited for my analyses. I'm not a MongoDB expert by any means, so here are some notes from my data wrangling adventures as I learn its tool set.

(Some MongoDB haters might argue that I should just export the necessary data into another format, but I'll stick with MongoDB since it seems fine for my use case: a single user performing data analysis on a single desktop Linux machine using Python, which connects to MongoDB using PyMongo.)

Initial exploration

Since I didn't want to prematurely optimize, I first played around with the full (~1 terabyte) raw data set (stored in one giant MongoDB database) to see if I can do what I need on it.

These web pages were very helpful in getting my feet wet:

I ran some initial basic queries over the full data set -- e.g., .find(), .aggregate(), .ensureIndex() -- and saw that anything non-trivial takes a few hours to finish. Obviously that wasn't acceptable.

Even after the proper indices were built, running meaningful queries still took anywhere from 30 minutes to 1 hour. (Of course, I'm not an indexing ninja; for instance, I didn't try any multi-key indices. But whatevers.)

Thus, I clearly needed to extract a subset.

Subsetting into a separate database

Here's what I ran to extract a subset of the data (from main_db) into a separate database (called philip_guo_db):

# start with the main database containing raw data set:
use main_db
# connect to my own separate database:
var philip_guo_db =
  connect('localhost:27017/philip_guo_db')
# find a subset of data, iterate through, and insert:
db.collectionX.find()
  .forEach(function (p) {
             philip_guo_db.subset_Y.insert(p);})

(I created a few subsets by running this command with different subset filtering conditions.)

I purposely put my subset into a separate database rather than just a new collection in the same database, since MongoDB stores each database as a separately-named collection of files. Doing so provides a bit more isolation and fault tolerance against mishaps, and also makes it easier to back up my own data subset. Also, since locks are often acquired per-database (I think?), my colleagues can now run write operations on the raw data (e.g., appending, cleaning, adding new indices) without interfering with my analyses.

Creating indices

After subsetting, I created single-key indices using .ensureIndex() and checked using .getIndexes(). Thankfully, index creation now only took a few minutes for my data sets.

Creating derived collections

Even after I've subsetted the data and stored it into my own database, I still inevitably need to derive additional data sets to facilitate specific analyses. That's because certain sorts of analyses are much easier when the data has been "reshaped" into a more convenient form.

The MongoDB aggregation framework is very useful for deriving auxiliary data sets. Using this framework, I can keep all of my derived data within MongoDB rather than as a bunch of loose plain-text, .csv, JSON, or (gasp!) Python pickle files littered throughout my hard drive.

For instance, say I have a giant web app event log in my database (called db.event_log), where each event has a username field. I want to create a derived collection (called db.event_log.username_to_eventcount) that maps each username to the total number of events for that user (eventcount). Here's what I ran:

var out = db.event_log.aggregate(
  [$group: {_id: "$username",
            eventcount: {$sum: 1}}
  ])

out.result.forEach(function(p) {
  db.event_log.username_to_eventcount.insert(p);
})

Prototyping tip: When writing aggregators, put {$limit: <int>} at the front of your pipeline to test your aggregator on a small subset before running on the full collection, which might take hours or even days (see docs for limit).

If your head is exploding trying to write a single giant aggregator, consider splitting your job up into multiple aggregator calls. For instance, you can iterate over the results of one aggregator run (using .forEach()) and feed each entry into another aggregator. Doing so might run slower in some cases, but it sure as heck beats scratching your head for hours being stuck on how to craft the perfect aggregator pipeline.

Finally, creating derived collections often takes a long time, so instead of working in the console, I write scripts in JavaScript files and then execute them from the terminal. I add progress counters, error indicator variables, and print() statements throughout my JavaScript code so that I can view progress from the terminal. Also, I can monitor progress and sanity-check by logging into a mongo shell session while my long-running job is executing; then I can run commands such as count() to check that my derived data set is indeed growing as expected, or find() to view partial results.

Recording informal provenance

I also try to "document" my data by adding a simple provenance string field to each collection with a short description of how that collection was created.

For instance, in the above example, I would write something like:

db.event_log.username_to_eventcount.provenance.save(
  {prov:
   "db.event_log.aggregate([$group: {_id: "$username", eventcount: {$sum: 1}}])"
  })

to record the fact that the username_to_eventcount collection was derived from event_log with the given aggregator.

Of course, this is simply informal documentation and not meant as some kind of fully-automated, machine-verifiable provenance solution. Still, it's better than no documentation! And MongoDB's schemaless nature makes it easy to add documentation directly into collections.

Misc. useful commands

  • Appending .pretty() to the end of a query pretty-prints results into a more human-readable JSON format.

  • Running mongostat and mongotop from the command line gives a high-level overview of which jobs are currently running.

  • Within the mongo shell, db.currentOp() shows list of running mongo jobs, and db.killOp() kills a process (like UNIX top and kill -9, respectively).

Created: 2013-06-28
Last modified: 2013-06-28
Related pages tagged as programming:
Related pages tagged as research: