July 7th 2008 03:03 pm
Comparing weblog text to the PhD dissertation via tagclouds
About a year ago I looked for Tools to find similarity between two texts (weblog and papers) - I wanted to find a relatively objective way to judge how much of my weblog writing ends up in the dissertation.
Between other things I experimented with generating and comparing tagclouds from texts that were supposed to correspond to each other. I tried several tools, but ended up with tagCrowd since it allowed using generic and custom-made lists of stop words.
As an experiment I used text of five dissertation chapters (draft versions as of April 17, 2008) and the text of blog posts coded as corresponding to those chapters to generate a visualisation of most frequent words in each case. After removing stop words (general English plus those from my own list that I was stupid enough not to save) 65 most frequent words are visualised.
For example, two tagclouds below are those from the blogposts related to the Microsoft study and the draft chapter with the results of it.


In total I had 5 pairs of visualisations. I then mixed them and asked five people familiar with my research (supervisors and collaborators) and eight students (taking a class with Anjo) to find matching pairs. The results are below.
| Total pairs | Correctly matched pairs | Correctly matched pairs, % | |
| Chapter 1. Introduction | 13 | 10 | 77% |
| Chapter 2. Methodology | 13 | 11 | 85% |
| Chapter 3. Ideas | 13 | 6 | 46% |
| Chapter 4. Conversations | 13 | 10 | 77% |
| Chapter 5. Microsoft | 13 | 9 | 69% |
| Total | 65 | 46 | 71% |
| by people familiar with the research | 25 | 20 | 80% |
| by people not familiar with the research | 40 | 26 | 65% |
Some comments:
- I guess there is a connection between PhD chapters and blogposts :)
- The high score for the methodology chapter is explained by its qualitative difference from the rest of the dissertation.
- The low score for this chapter is explained by the fact that the coding of weblog entries in relation to chapters was done prior to writing it. As a results it included many “might be relevant” posts, while for other chapters the focus was more clear. In addition, the draft version of the chapter used to generate the visualisation was the first draft, while in other cases those were revised several times.
It was nice to see that although many of the visualisations looked similar (with blogging and weblog being big ;) it was actually possible to match the pairs. But the nicest thing was simply making all those pictures, laying them on the floor and thinking that I actually had some version of 5 chapters out of the 7 :)
Related posts
3 Comments »
Jason Priem on 10 Jul 2008 at 2:51 #
Entirely awesome idea! There plenty of algorithms out there that attempt to determine the similarity between two documents (Plagiarism detection is one application among many; turnitin is an example of this). The tricky thing is that “similarity” is a pretty slippery concept. Getting raters to read both documents would be the gold standard, but it’s really slow. Machine comparison is fast, but it’s hard to get the algorithm perfect.
It would be interesting to do a three-way test: readers who read both documents, readers who read just tag clouds, and one or more programs. I think that your tag cloud method may be a good happy medium between quick-but-dumb machine comparisons and smart-but-slow human-reading. You still get access to those oh-so-human-gestalts of the documents, but visualization saves you a load of time.
Lilia Efimova on 10 Jul 2008 at 3:27 #
Tag clouds on the move « Making CommunitySense on 03 Aug 2008 at 9:07 #
[...] cloud for various purposes, but to compare tag clouds. Lilia Efimova gives a nice example of how she compared the tag clouds of her blog posts and a dissertation chapter on the same topic. Another comparison is to see how different tag cloud tools process the same text. Here’s the [...]