Topic Modeling the Panama Papers, One Way Or Another

If there is any major news story in recent times that lends itself to having an extreme amount of takes, angles, and approach vectors, that story probably would be the Panama Papers. Although it was never the ground-breaking, firestorm-inducing bombshell that many people thought it could be when the news first broke, the grand scale of the hack that exposed documents related to the Panamanian law firm and shell company seller Mossack Fonesca  — literal terabytes of data and millions of files, according to Süddeutsche Zeitung, the news outlet that broke the story — lends itself to being written about through a lot of different viewpoints. That’s more data than I have stored on most, if not all, of the computing devices I own — and unlike my hard drives, which would be mostly filled with video games and applications, this is mostly e-mails, database files, and other documents. That’s a lot of data to look at, and it’s not surprising that you could write stories about the leaked data with many unique and different focuses.

This makes the story very well suited to topic modeling, as we found out in our last workshop in HST 251 (“Doing Digital History”). Topic modeling seems to be a very complicated process, at least when it’s done by a computer, but I’ll try to explain it as simply and quickly as I can. Topic modeling is similar to the process of determining the topics and themes that a particular text focuses on. The average person can do that pretty easily, but modern computers don’t have the understanding of language and context that a person does, so it has to rely on techniques like analyzing groupings of words that appear near each other frequently. The algorithms that are used involve many other things and I’m definitely not the person to lecture on those things, but suffice it to say that it can find word groups (topics) easily enough and can analyze large bodies of texts to determine the topics that apply most to each specific text. If you want a much better explanation,  you can read this introduction by Megan Brett for more information.

To understand how topic modeling works, we attempted to imitate a topic modeling algorithm for a few articles about the Panama Papers, and compared our results to the results of a topic modeling run over a larger group of articles (including the ones we topic modeled) using the Topic Modeling Tool (TMT). For each article assigned to us, we tried to pick groups of words that were used more frequently in our article than in the corpus of articles as a whole. I was assigned two articles from Wordfence, the developers of a security plugin for WordPress, which can be found here and here.

As you might expect considering the source of these articles, they were both focused on how, exactly, the hack took place — according to both articles, the hack was made possible by vulnerabilities in other WordPress plugins and other web technologies. As such, I found that both articles featured words related to WordPress as well as words that came up generally in regard to hacking (e.g. data, server, network, etc). As they all seemed to relate quite a bit, I didn’t divide them into more than one topic. You can see my results in the image below.

topicmodeling
Note that the 2nd document is called “Doc 3” — this is because it was the 3rd document within the total corpus of 40 articles on the Panama Papers.

 

When I compared my results to the results of our 20-topic run with Topic Modeling Tool, there was a similarity in our results, but TMT actually got a bit more sophisticated than I did. Both of my assigned articles had two topics applied to them. They contained many of the same words as my single topic, but it broke them up into two categories — a topic containing general words that anyone might associate with hacking and computers, such as “web”,  “systems”, or “hacker”, and another topic that seemed to be more specific terminology for this hack, like “drupal”, “wordpress”, and “client”. To my eye, these categories seem to break up into the terminology that a more general description of the hack for people less aware of the technology would use (“basics”) and terminology that explains the details of the hack for someone familiar with WordPress and the ins and outs of running a website (“jargon”). Although I could have easily made this distinction, I didn’t recognize it and as a result TMT provided some clarity in the data that I did not. One other thing that I noticed was that in my assessment, the two articles had very similar levels of technical jargon, but according to TMT the first article had a much higher percentage of the “jargon” category compared to the third article. It is hard to tell if that is meaningful of just a strange consequence of how the categories break down, but it is interesting nonetheless — you can see the differences in percentages in the slideshow below.

This slideshow requires JavaScript.

We also examined a 40-topic run with TMT, which produced similar results. It found three topics within my assigned articles, one of which seemed to be an analogue to the “basics” topic I described above, with the others seeming to divide up the “jargon” topic into two separate topics. The distinction between these two levels of “jargon” wasn’t as stark as the divide in the first run, and as such I found 40 topics a bit less useful. However, I wouldn’t be surprised if examining the full topic word lists changed my mind in some way — after all, it’s only showing a sample of the words in the topic.

After getting some first-hand experience examining topic modeling techniques, I’m curious as to the differences one could make by tweaking a few variables in a project like Quantifying Kissinger or Mining the Dispatch. Considering the size of the data set each project is working, I’m not surprised that I felt a little overwhelmed by some of the diagrams that were shown — it was too much data for me to comprehend without spending a significant amount of time examining it. Taking in to account the differences that switching between 20 and 40 topics made in our experiment, I wonder if I would even recognize the connections and trends in Kissinger’s memos and telephone conversations if there were 5 more or 5 less topics allowed. I’m also curious about how one would even begin to decide on a number of topics — I feel like I’d fiddle with the exact number of topics and worry about which one gave the best insight into the data way too much.

Despite those thoughts, I was still very surprised and excited to see how well the computer found and analyzed topics in a corpus of texts. TMT did a better job than I did, and although I had limited time and experience, it can analyze text at a much faster rate than I ever could. Projects of this scale would clearly be very hard in a non-computerized world, and events like the Panama Papers prove that many of the data sets worth analyzing will only get bigger as time goes on and we learn to write more things down. This exploration is one of the most interesting tools we’ve examined so far in HST 251, and I hope to apply it to digital history on my own later down the line.

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s