07
Apr
2017
17:00 PM

State of the Union Text Analysis

The State of the Union, or its equivalent initial speech from a US president to Congress, is likely to provide ideas about where that particular president would like to steer the country on his watch. By analyzing these speeches across all candidates who have delivered one (a few died before having the opportunity), we should be able to see which presidents had similar thought processes and political beliefs across a 225 year period. We should also be able to detect significant changes in how these speeches were delivered, and what topics were central to each speech. To follow through on this, I have taken the first available speech to Congress for each president, and analyzed it using a mix of text extraction, text processing, and data visualization approaches. Let's see what this process reveals about the individual politicians as well as any larger changes that might have occurred over the last 2+ centuries.

Our first attempt to analyze text patterns within these speeches comes courtesy of a variety of exceptional tools - Aylien for entity extraction, Exploratory for additional text manipulation, Gephi for network graph creation and exploration, and finally, sigma.js for creating a series of interactive web displays.

Our source data for the project comes from the American Presidency Project, a site dedicated to gathering and sharing the papers and speeches for all American presidents. This provides us with a lot of rich material to analyze, including State of the Union or similar speeches delivered by each president to Congress. I thought it would be interesting to compare and contrast the initial speech delivered in a president's tenure, although it would no doubt be fascinating to see the changes within the annual speeches of individual presidents over their term in office.

The first step was to use Aylien's powerful entity extraction tool, which will read a web page at a provided address, analyze the text, and return results in a JSON format, using categories such as keywords, location, organizations, and a few others besides. For the purposes of this project, I selected the aforementioned keyword, location, and organization categories. From this base, we can then push the data into a single Excel (or similar) file where additional processing can begin.

We have two needs for this data set if we wish to analyze it using network graph analysis; first, we will require a node file, and second, there is also a need for an edge file where connections between nodes are defined. Now recall the previously mentioned categories extracted by Aylien; these categories and their member text terms will comprise the bulk of our nodes. There is, however, one additional set of nodes I have yet to mention - the individual speakers who delivered each of the dozens of speeches we are set to analyze. They will be at the heart of the network graph, which will be what is known as a bipartite graph, where the speakers (each president) will connect with nodes from the other three categories, but not to each other. Of course, some of the speakers will have shared terminology from their respective speeches, which will form indirect or 2nd level connections.

Once the data is in Excel we can manipulate it using pivot tables, with the goal of summarizing the data. For example, the raw data will have many instances of each president, with one row for each keyword, location, or category instance. In Excel, this detail can be summarized to create a single node for each distinct president, as well as for every distinct keyword, location, and organization. Once our data has been modified, both the node and edges tabs are exported as .csv files.

After the respective .csv files have been exported, we turn our attention to Exploratory, where we will perform multiple steps to refine the text before starting the visualization stages. Exploratory is a powerful new tool built to leverage the enormous power of R through a handsomely designed front end GUI, making it much easier for those with limited R skills (like myself) to access many useful packages and functions. Among these many tools are some powerful text mining and text analysis functions where we will perform multiple steps to refine our text files. At the end of these steps, we again out put a pair of .csv files (one node, one edge) that will be ingested by Gephi.

In Exploratory, we will perform the following steps on our node file:

  1. Tokenize the text as individual words, as many of the keywords identified by Aylien were in the form of phrases
  2. Filter out stopwords such as "the", "is", and many other common terms that typically connect the meaningful words
  3. Convert all tokens to lower case, so the same words can be grouped regardless of their original case (ex: "America" versus "america").
  4. Group by token, so we can get aggregate counts for how many times terms were used across all speeches. This will later help us to size the nodes in Gephi.
  5. Remove some unnecessary fields related to the individual documents
  6. Get distinct counts by token, which will remove duplicate values in the node file
  7. Export the data to a new .csv file

Here's what some of our steps look like in Exploratory (right-click on any images to see an enlarged view):

We then follow a similar process for the edges, making sure to convert to lower case so the nodes will align in Gephi. Otherwise, our earlier "America" and "america" example will yield two different values in our edges file, while the node file will contain only the lower case version.

Here are the primary steps we take on the edge file:

  1. Tokenize the text as individual words
  2. Filter out stopwords such as "the", "is", etc.
  3. Convert all tokens to lower case
  4. Export the data to a new .csv file

Here's what our steps look like:

Great! We're now ready to load data into Gephi. Gephi makes it very simple to load data from a variety of formats, including .csv files, a variety of graph file formats, and databases such as MySQL. The first step is to load the node file, as loading the edge file first will also populate the nodes, leaving behind any additional variables we may have created in Excel or Exploratory. The data import process is quite simple, as shown in this screen:

Following the nodes import, we'll do the same for the edges file:

Once both nodes and edges have been imported, Gephi has the base data necessary for graph creation. Of course, as designers and analysts we also play an instrumental role in creating, shaping, and analyzing the network. Once the data has been imported, Gephi will create an initial graph of sorts, although it will require additional steps to become useful:

As you can see, Gephi has given the nodes seemingly random positions in a bounded space, which certainly doesn't tell us much about our network and it's relationships. However, this will soon change as we select and apply a layout algorithm, then run a modularity process (clustering the nodes), and finally color and sizes the nodes based on selected attribute values.

Let's start with the layout algorithm. In general, network graphs will be optimized and analyzed using some sort of force-directed algorithm, which alternately pulls nodes together or pushes them apart depending on their similarities or differences. Thus, a well-designed network graph should represent similar values in close proximity to one another. Dissimilar nodes should not be close to one another, as it is likely their dissimilarity is based on connections (edges) that are routed to differing groups of nodes. For our graph, we'll use the Force Atlas 2 algorithm, a popular choice for drawing complex network graphs.

After running the Force Atlas 2 for awhile - perhaps 15-20 minutes on my laptop (8G RAM), the resulting graph has spread out quite a bit, and appears to have a story to tell.

Now there are clear groups that have formed, with a lot of delineation at the perimeter of the graph, and there's a reason for this pattern. In the Force Atlas 2 menu options, we elected to check dissuade hubs, which pushes hubs toward the edges. The hubs in this case are the actual speakers (the presidents); however, with our focus being on the text terms, it is those we wish to see at the center of the graph. Let's have a better look at the same graph after some styling choices have been made:

Colors now represent the modularity class every node falls into, and node size is now dictated by the number of inbound edges. This makes the most frequently used words the largest on the graph, quickly drawing our attention. Had we elected to use the total number of edges, or the number of outbound connections, the network would be visually dominated by all of the speakers, rather than their words. At this point our graph is attractive, shows some clear patterns, and uses size and color to differentiate points. However, it is still a busy graph with a bit of the so-called hairball effect. There are at least two ways this can be overcome; the first is to filter along certain attributes (such as modularity) to return only a portion of the total network. Our second option is to add interactivity, so that the viewer may interact with the graph by filtering, panning, and zooming. Let's explore each of these options.

Filtering is a very useful tool for viewing portions of a dense or large graph network. In Gephi, there are many ways we can filter our graph; we'll opt for using modularity classes for this example. Each modularity class represents a cluster of nodes that in theory should be similar to one another in some respect, and are likely to be in close proximity in the graph space. Let's begin by opening the Partition folder, and dragging the Modularity Class attribute to the filter area, where we'll select classes 62 & 63, as they should be closely positioned on the graph. After refreshing the graph window, here's the result:

Now we have a better look at the words in these clusters, their level of importance based on the number of times they were used by different presidents, and how they interrelate with other terms. Even though our graph is bipartite (i.e.- the words do not directly connect with one another), their proximity tells us something about their usage. In cases where the terms are very close, it is highly likely they were each spoken by the same group of presidents, or at least a very similar group. Thus, we can infer the closeness of the terms "administration" and "nations", as one example.

Let's view one more filtered example using the partition approach, looking at classes 53 & 54.

We now see a different section of the network, featuring new terms such as "navy" and "treasury". You might also note the proximity of the words "general" and "public" and suggest that the phrase "general public" may have been spoken by one or more presidents.

Let's view one more filter type on the graph, this time using an ego network. Ego networks are designed to help us see connections in the network of an individual node. Often we may wish to see only direct connections (depth = 1), or in some cases we may wish to view 2nd or 3rd degree connections to understand the network in greater detail. Given our bipartite network structure, only the 1st degree connection has a lot of value, so our examples will start there. We'll begin by seeing how many connections exist for the word "freedom", by dragging the ego network from the Topology folder down to the filter area. Astonishingly (or not), the word freedom has been used in just one of these presidential speeches:

In case you're wondering, it was used by President Eisenhower, who referenced it 3 times. Let's see if we can do better - how about the term "Peru"? Surely, a distant South American country can't be mentioned more often than a supposedly core foundation of our country. Or can it?

Apparently, Peru is 10 times more important than freedom, judging by the content of these speeches. Oh well, freedom was a nice thought.

Now on to the interactivity segment, one of my favorite aspects of network graph analysis and visualization. To aid Gephi in this venture, we will employ sigma.js, a wonderful graph drawing tool suitable for small to midsize graphs. Sigma.js will provide the ability to craft a web-based interactive network complete with search and filtering capabilities, as well as pan and zoom features. We also have the ability to customize the output using CSS and Javascript editing, which results in a finished network like this:

If you wish to interact with the graph directly, simply follow this link:

http://visualidity.com/projects/SOTU/

One final note, in case you're wondering about how the presidents are positioned in the graph. Are Trump and Obama close together? What about Reagan and Carter? The two Roosevelts? Suffice to say there is a lot of clustering based on the relative era a president served, which makes a lot of sense when you think about it. Not only were the issues likely similar, so was the language of the time. Nonetheless, you are likely to find some who cross eras to wind up close to someone from decades earlier or later. I suggest you peruse the graph and make these discoveries on your own.

Thanks for reading, and hope you found this enlightening.



You may also like