July 7th 2008
Comparing weblog text to the PhD dissertation via tagclouds
About a year ago I looked for Tools to find similarity between two texts (weblog and papers) - I wanted to find a relatively objective way to judge how much of my weblog writing ends up in the dissertation.
Between other things I experimented with generating and comparing tagclouds from texts that were supposed to correspond to each other. I tried several tools, but ended up with tagCrowd since it allowed using generic and custom-made lists of stop words.
As an experiment I used text of five dissertation chapters (draft versions as of April 17, 2008) and the text of blog posts coded as corresponding to those chapters to generate a visualisation of most frequent words in each case. After removing stop words (general English plus those from my own list that I was stupid enough not to save) 65 most frequent words are visualised.
For example, two tagclouds below are those from the blogposts related to the Microsoft study and the draft chapter with the results of it.


In total I had 5 pairs of visualisations. I then mixed them and asked five people familiar with my research (supervisors and collaborators) and eight students (taking a class with Anjo) to find matching pairs. The results are below.
| Total pairs | Correctly matched pairs | Correctly matched pairs, % | |
| Chapter 1. Introduction | 13 | 10 | 77% |
| Chapter 2. Methodology | 13 | 11 | 85% |
| Chapter 3. Ideas | 13 | 6 | 46% |
| Chapter 4. Conversations | 13 | 10 | 77% |
| Chapter 5. Microsoft | 13 | 9 | 69% |
| Total | 65 | 46 | 71% |
| by people familiar with the research | 25 | 20 | 80% |
| by people not familiar with the research | 40 | 26 | 65% |
Some comments:
- I guess there is a connection between PhD chapters and blogposts :)
- The high score for the methodology chapter is explained by its qualitative difference from the rest of the dissertation.
- The low score for this chapter is explained by the fact that the coding of weblog entries in relation to chapters was done prior to writing it. As a results it included many “might be relevant” posts, while for other chapters the focus was more clear. In addition, the draft version of the chapter used to generate the visualisation was the first draft, while in other cases those were revised several times.
It was nice to see that although many of the visualisations looked similar (with blogging and weblog being big ;) it was actually possible to match the pairs. But the nicest thing was simply making all those pictures, laying them on the floor and thinking that I actually had some version of 5 chapters out of the 7 :)


