19:20 PM

DNC Bernie Sanders Emails

Wikileaks has provided people like myself with an abundance of material to download, analyze, visualize, and ultimately to share insights on the behaviors of the elites, in this case the emails from the Democratic National Committee or DNC. Using this data source, we have the ability to mine specific aspects of the entire dataset using a simple search term on the Wikileaks site. For this post, and the accompanying visualizations, I have chosen to examine the DNC's treatment of Bernie Sanders, who materialized into a serious contender for the Democratic nomination.

It was revealed through many of these emails that the DNC was consciously favoring Hillary Clinton over the upstart Sanders. In this post, I will examine the linkages between both insiders at the DNC and outside contacts such as reporters and campaign personnel. To do this, I'll employ Gephi, the open source network analysis tool, followed by Sigma.js for visualizing the final networks on the web. The initial goal will be to understand the relationships in the network, using a variety of analytic measures such as centrality, modularity, connected components, and degrees. Using these measures, we will be able to better understand how data flowed both into and out of the DNC via the email channel.

What we'll wind up with is essentially a meta-view of the DNC's email activities. Our initial pass at the data using network analysis will not focus on the content of the emails; for that, we'll do some subsequent text mining to help us understand both the content and tone of the email exchanges. I hope to be able to tie these two pieces together, so that we may ultimately understand who was saying what about Sanders, and who it was being communicated to. Let's get started with the network analysis by providing some background on the graph statistics to be employed.

Let's start our overview of graph stats with the centrality measures - closeness, harmonic closeness, and betweenness centralities. Each of these are designed to communicate information about an individual node and its relationship to the entire network. In our case, all nodes represent individual emails in the dataset, a proxy for real people both inside and outside the DNC. Therefore, determining the centrality based on email address will give us a very strong indication for who the information flows to, through, or from. In the case of the two closeness centrality measures, a higher coefficient is an indicator of relative importance within the network. Generally, we would anticipate nodes placed in the center of the graph to have lower closeness scores, as this is a proxy for the effort required to reach all other nodes within the network. Nodes located at the far edges of a graph typically require more effort to traverse the network, and will thus have higher closeness centrality scores.

Betweenness centrality takes a different approach; nodes with a high betweenness score are not necessarily at the center of the action, but may provide a direct connection (also known as a bridge) between two otherwise unconnected groups. These are thus very important people within any sort of connected structure, as they facilitate communication across the network.

Let's look at a few examples from the graph to understand these measures better. First, by Closeness Centrality, as scaled between 0 and 1, with lower scores indicative or greater centrality:

Node Closeness
MirandaL@dnc.org .670886
PaustenbachM@dnc.org .602273
BrinsterJ@dnc.org .666667

Interestingly, these three nodes have relatively high closeness scores in spite of their apparent importance within the network as adjudged by their high degree levels. However, when we view the graph, it becomes apparent that none of these are near the geographic center of the network. Instead, each is somewhat central to their own class, but not the graph as a whole. Of the three, Paustenbach is more central in terms of requiring shorter average paths to reach all other nodes in the network, but there are many other members of the graph with lower closeness scores.

If we view Betweenness Centrality scores, a different story emerges. These scores are not normalized, and higher is indicative of greater influence. In simple terms, betweenness tells us who is most important as a connector between other nodes in the network. Here, it is Luis Miranda who exercises great influence; many others in the network must pass through Miranda to connect with one another.

Node Betweenness
MirandaL@dnc.org 1862.07619
PaustenbachM@dnc.org 660.359524
BrinsterJ@dnc.org 418.0

It appears that Miranda is roughly three times as important as Paustenbach as a connector (bridge) between the people in the graph. Brinster plays even less of a role as a conduit between people. In fact, these are the three most critical people as measured by their ability to link members of the network, with Miranda as a clear number one.

We can also look at network influence based on the sheer volume of connections, as measured by degrees. In the case of a directed network, such as the DNC email instance, we can also examine in-degrees (receiving emails) and out-degrees (sending emails). This information will be extremely helpful in determining the roles of the various players in the network. First, let's look at the total degrees for our three principals:

Node Degrees
MirandaL@dnc.org 74
PaustenbachM@dnc.org 32
BrinsterJ@dnc.org 28

Miranda has clearly formed the most connections for a combination of inbound and outbound emails mentioning Sanders. What about just in-degrees?

Node In-Degrees
MirandaL@dnc.org 39
PaustenbachM@dnc.org 12
BrinsterJ@dnc.org 23

Now we see that Brinster, despite having fewer total connections, has nearly double the in-degree count of Paustenbach. A majority of the emails for Brinster have him on the receiving end. This situation reverses itself for outbound communications, where Paustenbach plays a larger role. Of course, Miranda still has the highest number of connections.

Node Out-Degrees
MirandaL@dnc.org 35
PaustenbachM@dnc.org 20
BrinsterJ@dnc.org 5

These numbers tell us that the emails referencing Sanders are both coming and going for Miranda, while Paustenbach is mostly on the sending side and Brinster on the receiving end. These patterns likely correspond with their specific roles within the DNC.

Another aspect of many networks relates to how users cluster into groups, based on either formal structures or informal networks. One of the best ways to track this using Gephi is through the Modularity Class statistic, which assesses the relationships between all nodes and places them in a specific class (or cluster) based on their behaviors and relationships. In a best case scenario, the clusters should be visually distinct from one another, making it easy for the viewer to detect patterns and relationships. It is often very useful to color the nodes based on modularity; we will employ this approach with the DNC network. Miranda and Paustenbach wind up in Class 1, while Brinster is in Class 2.

Finally, we can employ two additional measures of influence in the graph - hub and authority, derived using the HITS algorithm developed by Jon Kleinberg to measure web page influence. https://github.com/gephi/gephi/wiki/HITS. This algorithm scores (from 0 to 1) the influence of individual nodes in the network; think of each node as its own web page within the DNC network graph. Herewith the scores for our trio of players, hub first, followed by authority.

Node Hub Authority
MirandaL@dnc.org .513207 .572891
PaustenbachM@dnc.org .419807 .260051
BrinsterJ@dnc.org .033412 .127633

Essentially, the hub value is a normalized sum of all the authority values from connected nodes, while the authority value is the normalized sum of hub values from connected nodes. If this seems a bit confusing, just think of both as relative measures of influence within a network. Miranda is quite high on both measures, while Paustenbach is considerably higher on hub versus authority, with Brinster just the opposite. In layman's terms, it appears that Brinster has fewer high authority connections, thus reducing his hub score. If you would like more details, Kleinberg's paper can be found here.

Alright, it's time to move on to the graphs, beginning with the full network, which can be found here in interactive mode. Here's a static view of the network:

DNC Sanders network

The network nodes have been colored by modularity class, as noted earlier, to make it easy to identify some patterns. We can quickly see a large grouping of green nodes dominating the network; this would be a good starting point for any analysis. Other clusters can be identified beyond this dominant group, with some loose connections binding the various groups. It should be noted at this point that only the graph's so-called giant component is shown here, which has eliminated some distant nodes that formed their own very small groups.

To understand the network and its components better, we can filter the graph by modularity class, thus creating subgraphs with only the members of our selected classes. Four of these classes make up the entire giant component, with two clusters accounting for more than 75% of the nodes in the entire network. Let's take the classes in order of importance, as adjudged by the number of nodes represented in a class. Modularity Class 1 is by far the most critical cluster in the network, with nearly 50% of all nodes. This is likely where the most interesting insights will be found, simply based on the sheer volume of emails to be found. Let's have a look at the interactive version here, or a static view:

DNC Sanders network Mod 1

This class is where we find notables such as Luis Miranda, former Communications Director for the DNC, as well as his colleagues Amy Dacey and Brad Marshall, former Chief Executive and Chief Financial Officer, respectively, for the organization. Donna Brazile, the disgraced former chairwoman and CNN commentator is also part of this cluster. Miranda is at the center of things, with nearly equal numbers of inbound and outbound communications. A secondary hub can be found for Mark Paustenbach, DNC National Press Secretary, using the PaustenbachM@dnc.org email address.

The second most influential cluster is modularity class 2, accounting for more than 25% of the email addresses in the network. DNC Research Associate Jeremy Brinster (BrinsterJ@dnc.org) is influential within this cluster, primarily through inbound email communications. Many of these emails are from the Sanders campaign via the Bernie2016press@berniesanders.com and michael@berniesanders.com addresses. These two addresses account for 43 and 50 communications, respectively, which is portrayed by the very thick edges connecting them to Brinster. The interactive graph is here, with a static view shown below.

DNC Sanders network Mod 2

Classes 0 and 3 are considerably less influential, each with roughly 10% of the graph's members. It appears that Class 0 is composed primarily of external contacts, particularly from the Politico website, who are seeking information through Jordan Kaplan, National Finance Director for the DNC. Meanwhile, Class 3 is centered around Ryan Banfill, with external contacts apparently seeking information. Here is Class 0, both interactive and static:

DNC Sanders network Mod 0

Finally, Class 3 interactive and static:

DNC Sanders network Mod 3

Hope you found this interesting, and take some time to work with the interactive graph versions. Next up will be some text analysis of the actual email exchanges.

You may also like