In my continuing look at best-in-class investigation capabilities, I am constantly on the hunt for innovative new ways that technology can be applied in unique and interesting ways. There is a limitless supply of data, as we all know, and bringing together massive amounts of data – from a variety of sources – is key to allowing an investigator to do her job. More than ever, however, the analysis of simple email messages is becoming part and parcel of analyzing, investigating, and adjudicating criminal behavior.
E-mail is one of a number of common communications and social media formats that is becoming an increasing important source of investigative data to the intelligence analyst. While many analysts have experience in acquiring and analyzing telephone call data, email is often seen as a job for forensic specialists because of the additional challenges in converting and modeling this type of data. In this article, we will focus on a small number of techniques for the processing and analysis of email data that can be adopted by the crime or intelligence analyst. We will consider:
- The components of an email message and how they lend themselves to different analytical techniques
- How to prepare the data for analysis
- An example of analytical workflow where different tools and techniques are employed to create and test different types of hypotheses.
Email Message Components
First, it is worthwhile to examine the components that make up an email message, as these offer different opportunities for analysis. It will be apparent when reading an email message that it is composed of different components: the header, the body, and optionally, a set of attachments.
The header is largely structured data with fields such as From, To, CC, Date/Time Sent, etc. There are other important pieces of metadata hidden in the header as well (many email clients provide the option to view this data, but most users will never want to see it), but we will focus on the fields described above. The header also contains a Subject field, the contents of which will be a free text description of what the message is about.
The body of the message consists of free text. You may also see images embedded in the body of a message, but these are essentially attachment files. Attachments can include pictures, documents and other files, each of which will possess various properties that may be of interest to some investigations.
Forwarded mails and other duplications (i.e. same message in different user mailboxes) add additional complexity to any email data set.
Most people’s experience of email is through the email client applications they use on a daily basis to compose and read their email messages. The two main types of client are the local (or fat) client such as Microsoft Outlook and web-based clients which include Yahoo Mail and Gmail. Where the actual email data is stored can vary between email clients. In the case of “web mail”, the data is stored and controlled by the hosting organization, so legal steps are needed to acquire such information. Users of local clients can download copies of email from a local email server, or alternatively access their mailbox on the server.
In the context of crime analysis, we can usually assume that the analyst approaches an email data set with one or more specific questions in mind: a particular email account, a slice of time, a key word or attachment. These questions may be even more abstract in nature – are there groups of people that are working closely together, do individuals assume roles (gatekeeper, leader), are there differences between email communications and the organizational chart?
In the examples that we will work through in this article, we will start with a particular Organization in mind, identify a set of key messages, and then turn our attention to the wider picture of how these discussions evolve over time, identifying some key players along the way. We will use two pieces of software to complete the analysis – Vound Software’s Intella®, and Xanalys Limited’s Link Explorer®.
We’ll start our analysis by identifying a set of email messages that mention a certain company. The software application used for this first phase isIntella, a “forensic search” tool that not only allows the analyst to load a range of different email data sources ready for analysis (thus solving our first challenge of how to represent the email data), but also provides an extremely powerful search mechanism that allows an Analyst with little or no computer forensic experience to quickly identify emails of interest.
Intella can not only identify and interrogate the text content of an email, but all metadata, binary content, encrypted items and various attachment types.
Associational and Temporal Analysis
Having identified a set of mails of interest, we will then move the data into a database where it can be modelled as a set of objects and links. Email, like telephone data, has both transactional (From/To) and temporal (Date/Time Sent) elements, so representing the data as a graph provides the analyst with the opportunity to apply a range of link analysis and social network analysis techniques to the dataset. Remember that before you start your analysis, make sure to establish your investigative goals and tasks, as this will largely determine the tools and techniques you use, as well as keeping your lines of enquiry focused.
As a data source, I have used a small set of the 148 PST files belonging to the publically released Enron data set. Having acquired these PST files, the first step is to convert the native email files into a common format – this is especially important if we are dealing with a range of sources from different email clients.
Using Intella, I select the 2 PST source files – Intella loads and converts the files ready for analysis. I have randomly selected the company “F&J” as the start of my inquiry. Entering the string “F&J” as my first search parameter, Intella returns a set of matching messages as a cluster in visualization of the search results. Clicking on this cluster displays the messages in a grid view.
By adding additional search strings, the cluster map will show how these parameters interact. In the following picture, search terms have been entered and a single mail is shown to contain all three search parameters.
The items in the grid can be opened, sorted, and captured in unique categories for future reference.
Having identified a set of messages-of-interest, we will widen the scope of our inquiry to consider:
- The social networks involved in the discussions, including highlighting the senders and receivers of messages in these conversation threads
- How the network evolves over time – who joins the conversation and who leaves it
- The nature of particular user activity prior to and after key email messages concerning F&J were sent.
In addition to providing import functions for a wide range of email formats, Intella can also export sets of the converted email data into a structured SCV format to be viewed in Microsoft Excel and/or loaded into link analysis or other analytical tools. For this example, we will not only export the set of messages that we were searching for, but the entire data set. This is needed to extend our search from the starting set of “F&J” data.
The second application used in this example, as noted above, is Xanalys Link Explorer which can be used to powerfully query, visualize, and analyze massive data sets.
The first task in Link Explorer is to import the Intella CSV files into a database (Link Explorer also comes with its own database). This import process not only transfers and loads the data, but also matches and unifies common elements within the data (e.g. if an email address is mentioned in the header of multiple messages, it is treated as a single entity linked to the many message entities).
Once loaded, the user can view the data not as columns and rows in a grid, but as sets of entities and links – in fact, two different models are created:
Although these two models essentially represent the same information, in the top model the message is represented by a directional link between the two email accounts, while in the second, the message is represented by an actual object. Each model has certain benefits when querying and viewing data. Clearly, the main focus is on the data stored in the “header” of the email – date/time sent, From, To, CC, etc. However, Link Explorer also supports free text search which can be applied to the Subject field, and if the body of the document is also imported into the database, to the main text content of the email.
Having completed the export/import process, we recreate the “F&J” message set in Link Explorer.
Next, we apply the Explore function on these messages, nominating to return only the senders of these messages. “Explore” is a data mining feature that returns linked objects at one or more degrees of association to our starting object(s). An associated feature that we won’t use in this example is “Find Links” which attempts to identify pathways between two objects in the database. Both of these features are extremely useful in determining associations between email accounts.
Having identified the senders of the mails, we tag, or “capture” these objects so that can be easily selected in future charts.
Next, Explore is used again to return the full set of email accounts associated with these messages. The resulting diagram shows us a network of accounts and messages, with the red links indicating the message had been sent from that account.
Here it is useful to view this data set using the alternative model – by removing the message entities, we obtain a clearer picture of the social network involved in these threads of conversation. The number of messages “bundled” under each link is indicated on the links.
We can gain quite a different perspective of this network if we take into account the temporal aspects of the data. Having displayed the chart as a timeline, the 3 senders of the messages are selected (using the “tag” we created earlier). Note that there are definite patterns in how participants in the F&J threads join and leave the message threads.
Zooming in on the end of the thread, we see the conversation filter down to just 2 participants, once of which hasn’t been involved at the earlier times. We also note the subject of the message now has the added label of “urgent”.
Can we learn more about this person and what their role in the organization is? Selecting the icon representing the account, we create a new chart containing just that entity. Applying the Explore again, this time however we focus on returning only the messages they send/receive around the time of those “urgent” messages – we do this by constraining our Explore operation with a start and end date.
Our final chart reveals a set of messages, some of which concern the Company President. Doubleclicking on the icon will open a copy of the message that can be reviewed.
In this article, we have introduced a number of simple analytical techniques, some which are commonly used on other communication data such as telephone call analysis, that allow us to query and visualize a corpus of email data. For the analyst, while it is important to gain a practical knowledge of these techniques, it is just as important to understand the different components of email data, and which of these should be focused on to reveal the information relevant to their investigation.