≡ Menu

Comparing weblog text to the PhD dissertation via tagclouds

About a year ago I looked for Tools to find similarity between two texts (weblog and papers) – I wanted to find a relatively objective way to judge how much of my weblog writing ends up in the dissertation.

Between other things I experimented with generating and comparing tagclouds from texts that were supposed to correspond to each other. I tried several tools, but ended up with tagCrowd since it allowed using generic and custom-made lists of stop words.

As an experiment I used text of five dissertation chapters (draft versions as of April 17, 2008) and the text of blog posts coded as corresponding to those chapters to generate a visualisation of most frequent words in each case. After removing stop words (general English plus those from my own list that I was stupid enough not to save) 65 most frequent words are visualised.

For example, two tagclouds below are those from the blogposts related to the Microsoft study and the draft chapter with the results of it.
Tagcrowd: blogposts related to chapter 6 (Microsoft)Tagcrowd: current draft chapter 6 (Microsoft)

In total I had 5 pairs of visualisations. I then mixed them and asked five people familiar with my research (supervisors and collaborators) and eight students (taking a class with Anjo) to find matching pairs. The results are below.

Total pairs Correctly matched pairs Correctly matched pairs, %
Chapter 1. Introduction 13 10 77%
Chapter 2. Methodology 13 11 85%
Chapter 3. Ideas 13 6 46%
Chapter 4. Conversations 13 10 77%
Chapter 5. Microsoft 13 9 69%
Total 65 46 71%
by people familiar with the research 25 20 80%
by people not familiar with the research 40 26 65%

Some comments:

  • I guess there is a connection between PhD chapters and blogposts 🙂
  • The high score for the methodology chapter is explained by its qualitative difference from the rest of the dissertation.
  • The low score for this chapter is explained by the fact that the coding of weblog entries in relation to chapters was done prior to writing it. As a results it included many “might be relevant” posts, while for other chapters the focus was more clear. In addition, the draft version of the chapter used to generate the visualisation was the first draft, while in other cases those were revised several times.

Tagcrowds: current state of the dissertationIt was nice to see that although many of the visualisations looked similar (with blogging and weblog being big 😉 it was actually possible to match the pairs. But the nicest thing was simply making all those pictures, laying them on the floor and thinking that I actually had some version of 5 chapters out of the 7 🙂

{ 3 comments… add one }
  • Jason Priem July 10, 2008, 02:51

    Entirely awesome idea! There plenty of algorithms out there that attempt to determine the similarity between two documents (Plagiarism detection is one application among many; turnitin is an example of this). The tricky thing is that “similarity” is a pretty slippery concept. Getting raters to read both documents would be the gold standard, but it’s really slow. Machine comparison is fast, but it’s hard to get the algorithm perfect.

    It would be interesting to do a three-way test: readers who read both documents, readers who read just tag clouds, and one or more programs. I think that your tag cloud method may be a good happy medium between quick-but-dumb machine comparisons and smart-but-slow human-reading. You still get access to those oh-so-human-gestalts of the documents, but visualization saves you a load of time.

  • Lilia Efimova July 10, 2008, 15:27

    It’s interesting how a tag cloud represents something about the text in the spaces between words, isn’t it? 🙂

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.