Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Email analysis scripts for mbox mailbox files

Many email programs (e.g., Mozilla Thunderbird, Eudora, Horde Webmail) store messages in files conforming to the mbox format. I have written some Python scripts to extract and organize email header information (e.g., senders, recipients, subjects, dates, etc.) from mbox files. I used these scripts to collect data for the graphs in my article, MIT: A Life in Emails. I have only used these scripts for this one application, so they aren't too thoroughly tested. Use them at your own risk!

Because I analyzed large mbox files containing thousands of messages, I broke up my analysis into two stages to improve efficiency:

  • Stage 1: Creating an XML summary of an mbox file

    create-mbox-summary.py - Run this on an mbox file to create an XML summary (printed to stdout).

    mbox-summary.dtd - The generated XML summary file should conform to this DTD.

    For large mbox files, the XML summary file is much smaller than the original because it only contains the header information. The smaller size (and greater hierarchical organization) of the XML summary file makes it easier and more efficient to run analyses on it rather than on the mbox file.

  • Stage 2: Analysis of the XML summary file

    MboxSummary.py - Classes whose instances can be instantiated with XML summary files and contain fields that are conducive to various analyses involving dates, times, senders, recipients, subjects, etc. Also contains some auxiliary functions for performing filtering and building histograms.

    To see an example usage of MboxSummary.py, check out MITEmailAnalysis.py, which contains functions to generate data for my article, MIT: A Life in Emails.

I'm not enthusiastic enough right now to provide more detailed explanations and examples, because I figure that if you want to use this code, you probably have a pretty good idea of what you are doing. Best of luck in mbox hacking, and email me if you come up with any interesting analyses based on the framework provided by these scripts!

Created: 2006-09-16
Last modified: 2006-09-16
Related pages tagged as software: