20:06 PM

Visualizing CEPA Education Data, Part 2

A couple months back, I wrote about investigating CEPA academic achievement data (provided through the CEPA project at Stanford University (Sean F. Reardon, Demetra Kalogrides, Andrew Ho, Ben Shear, Kenneth Shores, Erin Fahle. (2016). Stanford Education Data Archive. http://purl.stanford.edu/db586ns4974). Finally, I've got part two, wherein I use Exploratory, the powerful R-based tool for data wrangling, analysis, and visualization. Exploratory enables folks like me who love the power of R but have not been immersed in the oftentimes complex world of R coding. The Exploratory front end makes using R a pleasure, as I hope this post will help illustrate.

Exploratory takes the essential R dataframe as a starting point, and then allows you to easily manipulate your data using a multitude of R packages included with the base Exploratory install. This post will walk through some simple analysis using CEPA data within Exploratory. We'll begin by setting the stage with a screenshot of the Exploratory workspace.

On the left side are the dataframes belonging to this project, while the center is dedicated to the primary workspace where all tables and charts will be viewed. Off to the right is where any actions are stored - filters, functions, and so on that act upon the dataframe. In this case, we're viewing the Summary tab for a particular dataframe. We can also look at a tabular view of the data by clicking the Table icon:

Finally, we have the option to see our data in visual form by selecting the Viz tab icon:

Note that there are multiple chart options as well as pivot table capability. In this case, I selected a box plot with grade levels on the x axis and average grade level performance for math on the y axis. With this quick overview, the data is easy to comprehend, as we see the distribution by grade for the entire dataset. As expected, Grade 4 has higher average scores than Grade 3, and so on up through Grade 8. At a very high level, the data is linear. What is interesting, however, is the significant overlap in the distribution of scores. We see the bottom of the Interquartile Range (IQR) for Grade 8 actually overlapping with the Grade 4 distribution. In other words, it appears many school districts have higher test grade levels from 4th grade students than many other districts achieve with their 8th grade students. This provides an interesting angle we could pursue further.

Next, let's use a filter to narrow the dataset down. In this instance, we will return data for only Michigan ("MI") and Massachusetts ("MA") for the 2013 test year. Now the same box plot takes on a slightly different appearance:

Given that our data has been reduced to just two states, let's find a better way to compare their test results. This can be easily done by changing the color setting to be based on the stateabb field in the dataset. Now we get a much clearer picture of differences between the two states:

The box plots tell us that Massachusetts test results are significantly higher on average compared to Michigan's scores. If we were a social scientist or education consultant, we might want to learn why scores differ so greatly at a state level. However, for the purposes of this article, we'll continue on our path to understanding the data at a deeper level. To do this, well first use one of the great new features in Exploratory - small multiples. This capability allows us to create individual charts based on values within a specific variable, using the Repeat By option. In this case, we'll use stateabb again and switch our plot to a scatter chart, yielding the following charts:

These confirm our initial impression that test scores are almost uniformly higher in Massachusetts. Let's filter the data once more by selecting Grade 8, so we can do a more direct comparison. Here's what our filter settings look like:

We'll also add a Top N setting using the tab above the workspace, and set it to retrieve the top 40 values. This will show us the top performing districts across both states, and may provide further insight into performance gaps. With our reduced dataset dimensionality, we can now use bar charts to display the data, as seen here:

The results are rather stunning - only 3 of the top 40 school districts are in Michigan. Clearly, 8th grade students in many Massachusetts districts are far ahead of their Michigan peers in true grade level math performance. We can see a handful of cases where districts are performing at 12th grade levels, while the very best Michigan districts are closer to 10th grade level scores. The fact that Michigan has far more districts than Massachusetts makes the results even more compelling; based on sheer quantity, we might have expected more than three Michigan districts to land in the top 40. Perhaps Michigan districts will fare better in English scores (ELA). To find out, we simply swap out the math and ELA fields on the y axis. Now we see the top 40 districts for ELA:

Wow! Now there's only one Michigan school in the top 40. Based on our quick exploration of the data, it seems clear that Massachusetts is doing something at a statewide level that is yielding far better results than Michigan is achieving. At this point, we could go even deeper into the data to understand any other variables that may be influencing the results, but that's best left for future analysis. Suffice to say, even when I looked to weight things based on a variety of socioeconomic variables found in the CEPA data, Michigan lagged well behind. Perhaps we need to see how it compares to other states in the same geographic region to get a better comparison.

That's it for now. I hope this has given you a small glimpse into the power and ease of use of Exploratory. Personally, I'm looking forward to some more data wrangling using all kinds of datasets ranging from baseball to economic data, and even some datasets where text analysis is the goal. Thanks for reading!

You may also like