July 7th 2008

Comparing weblog text to the PhD dissertation via tagclouds

About a year ago I looked for Tools to find similarity between two texts (weblog and papers) - I wanted to find a relatively objective way to judge how much of my weblog writing ends up in the dissertation.

Between other things I experimented with generating and comparing tagclouds from texts that were supposed to correspond to each other. I tried several tools, but ended up with tagCrowd since it allowed using generic and custom-made lists of stop words.

As an experiment I used text of five dissertation chapters (draft versions as of April 17, 2008) and the text of blog posts coded as corresponding to those chapters to generate a visualisation of most frequent words in each case. After removing stop words (general English plus those from my own list that I was stupid enough not to save) 65 most frequent words are visualised.

For example, two tagclouds below are those from the blogposts related to the Microsoft study and the draft chapter with the results of it.
Tagcrowd: blogposts related to chapter 6 (Microsoft)Tagcrowd: current draft chapter 6 (Microsoft)

In total I had 5 pairs of visualisations. I then mixed them and asked five people familiar with my research (supervisors and collaborators) and eight students (taking a class with Anjo) to find matching pairs. The results are below.

Total pairs Correctly matched pairs Correctly matched pairs, %
Chapter 1. Introduction 13 10 77%
Chapter 2. Methodology 13 11 85%
Chapter 3. Ideas 13 6 46%
Chapter 4. Conversations 13 10 77%
Chapter 5. Microsoft 13 9 69%
Total 65 46 71%
by people familiar with the research 25 20 80%
by people not familiar with the research 40 26 65%

Some comments:

  • I guess there is a connection between PhD chapters and blogposts :)
  • The high score for the methodology chapter is explained by its qualitative difference from the rest of the dissertation.
  • The low score for this chapter is explained by the fact that the coding of weblog entries in relation to chapters was done prior to writing it. As a results it included many “might be relevant” posts, while for other chapters the focus was more clear. In addition, the draft version of the chapter used to generate the visualisation was the first draft, while in other cases those were revised several times.

Tagcrowds: current state of the dissertationIt was nice to see that although many of the visualisations looked similar (with blogging and weblog being big ;) it was actually possible to match the pairs. But the nicest thing was simply making all those pictures, laying them on the floor and thinking that I actually had some version of 5 chapters out of the 7 :)

Tags: , , ,

3 Comments »

November 14th 2007

Getting more by reading less blogs: some thoughts on ‘Cost-Effective Outbreak Detection in Networks’

Matthew Hurst on the most important blogs for efficient readers:

A group of researchers at CMU have been considering a notion of blog importance based on how likely a set of blogs is to ensure that you will be informed of topics bursting in the blogosphere. By analogy, they consider a graph of water pipelines. Their paper - Cost-Effective Outbreak Detection in Networks Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance - poses the problem:

Given a water distribution network, where should we place sensors to quickly detect contaminants? Or, which blogs should we read to avoid missing important stories? These seemingly different problems share common structure: Outbreak detection can be modeled as selecting nodes (sensor locations, blogs) in a network, in order to detect the spreading of a virus or information as quickly as possible.

As a result of this work, the authors have published some blog lists which answer a fundamentally important question in terms of weblog reading habits: Which weblogs should I read to be most up to date? The lists answering this question - generated by the approach described in their paper - come in a number of varieties to be found on the project’s page.

I scanned (skipped most of the math :) through the extended version of the paper and this is something I would love to see applied to niche blogging networks. For example, starting from a subset of weblogs that mention topic X or, better, those that participate in a discussion (cascade) that mentiones topic X.

A few points relevant from the practical perspective - having a tool that helps a blogreader to make a selection of blogs to read (my expectations in that respect are pretty high given that Natalie Glance is working for Google now :)

1. “Costs” of reading. The authors played with optimising the number of blogs and number of posts one reads. Assuming that reading less blog posts is more cost-effective, the algorithm shows that “the popular blogs might not be the most effective way to catch relevant information cascades” (p.23). Instead, it makes more sense to read “good summarizer blogs that may not be very popular, but which, by using few posts, catch most of the important stories propagating over the blogosphere” (p.15).

2. Predicting the future. From a reader perspective one would like to have a recommendation of blogs that will cover most interesting stories in the future. From what I understood the algorithm does not work that well for making those predictions. The authors optimised the performance by including only big blogs (= at least one post per day), but I wonder if there are some other alternatives.

Anyway, I guess I should go back to my PhD writing and wait patiently till people who read the paper without skipping the math do something with it. So far I’m happy that the paper promises lots of interesting developments and that it also makes me feeling less guilty with our alternative approach to vaccination by suggesting that “uniform ummunization strategy corresponds to randomly placing sensors in a water network” (p.22), which in not optimal :)))

Archived version of this entry is available at http://blog.mathemagenic.com/2007/11/14.html#a1953; comments are here.

Tags: , , ,

No Comments yet »

June 12th 2007

Tools to find similarity between two texts (weblog and papers)

I’m playing with an idea of comparing (parts of) my weblog with some of my published papers (and with the dissertation as a whole when I’m done). So far I’m interested in two things:

  • how much of the text is reused
  • how conceptually close two texts (weblog and a paper) are

Thought of a couple of ways to do so:

  • One way would be to use all kinds of weblog analysis tool from Anjo. One of the difficulties there would be to figure out how to find similarities between weblog text, which is relatively self-contained microcontent pieces, and linear “build upon previousely said” academic papers.
  • Another option would be to use some plagiarism detection tools. Only wonder if you can configure those to compare target paper with a specific weblog, rather than with “everything published”.

Any ideas?

Archived version of this entry is available at http://blog.mathemagenic.com/2007/06/12.html#a1909; comments are here.

Tags: , , ,

No Comments yet »

June 15th 2006

Papers of WWW2006 workshop on the weblogging ecosystem

Papers from 3rd Annual Workshop on the Weblogging Ecosystem (see also papers from 2004 and 2005 workshops).

I seriousely considered going, but it would cut a week from my honeymoon… At least now there is nice collection for reading.

Archived version of this entry is available at http://blog.mathemagenic.com/2006/06/15.html#a1777; comments are here.

Tags: , , ,

No Comments yet »

April 11th 2006

Feed your blog to tOKo and see what comes out

Anjo is moving further in developing a blog-friendly version of tOKo (related to all our earlier work on weblog communities, conversations and topics):

A little bit of progress on the open source version of tOKo (and the like), and in particular making it suitable for bloggers.

The first problem is turning a (your?) blog into a corpus. tOKo is pretty flexible as to what a corpus looks like, but the process must be automated. Jack Vinson and Ton Zijlstra provided great help by converting their blogs to a Movable Type export file and making the result available. Therefore, tOKo now contains a “Create corpus from Movable Type” function. The nice thing is that several blogging platforms provide Movable Type (MT) export. For example, in TypePad (which I use) a MT file can be generated from the web interface. Moreover, an MT file contains all information, including comments and trackbacks.

I’m getting into research fun anticipation - getting hold of comments next to post text would be such a great thing for the analysis :)

And, if want to help to develop the tool you can contribute your blog archives in Movable Type format (WPexport could be handy for WordPress users). This especially makes sense if you feel belonging to KM bloggers community (paper) - or, as Anjo puts it:

If you have linked to Jack, Ton, Lilia or myself in the past, this would be particularly interesting (also if you can only export to Movable Type). The only disadvantage of making your weblog available is that I might ask you to alpha-test tOKo :-).

My email address is: anjo science uva nl (one at, two dots).

You get a bit more insight about this work from Ton’s impressions on the work in progress and Anjo’s visualisations (1, 2, 3, 4).

Archived version of this entry is available at http://blog.mathemagenic.com/2006/04/11.html#a1761; comments are here.

Tags: , , , , ,

No Comments yet »

September 4th 2005

Shout if you want to be heard or Technorati blog finder

For all the unhappy ones: Technorati performance and scalability improvement progress and Technorati blog finder (via David Sifry).

Things to know:

1. You have to categorise your weblog manually:

By default, the blogs are presented in order of authority, which means highly-linked blogs appear first. So each of these Blog Finder pages is like a mini Top 100 for any topic you can imagine. You can also sort each tag by how recently the blogs have updated, or alphabetical by title.

And for all you bloggers out there, this is a great opportunity for your blog to get found. If you’re already a Technorati member with a claimed blog, all you have to do is visit your Configure Blog page to choose which tags you want to use. You can add up to 20 tags per blog.

2. It’s prepopulated based on existing tags:

We kicked off the Blog Finder by auto-classifying blogs based on the tags they use in posts most often. But you can list your blog under any tag you like, up to 20.

Of course I went to check for my weblog and didn’t find it under KM, “knowledge management” and “learning”. Not surprising since I don’t really use Technorati tags (necessary mark-up is not produced by LiveTopics and I’m too lazy to add tags manually next to adding topics).

Clearly that those users who don’t know or don’t care about tagging especially for Technorati are out of the system (which reinforces “shout if you want to be heard” behavior with all its implications).

Thoughts:

  • May be some kind of extrapolation could work - if top blogs on a topic link to the specific blog frequently it could be included into the topic list.
  • Wonder how tagging at post level (curent auto-classification) would intergrate with blog-level tagging that is asked for.
  • If auto-classification stays (which makes sense) and continues influencing one’s inclusion into the lists - how this would influence post-level tagging (e.g. adding unnecessary tags).
  • First though of spammers, but then realised that it’s more or less covered by sorting based on incoming links (of course, untill someone heavily linked in one domain starts adding tags for another domain that has nothing to do with the blog focus).

And, an example of overcoming being lazy and conforming to “shout if you want to be heard” practice :)

Archived version of this entry is available at http://blog.mathemagenic.com/2005/09/04.html#a1654; comments are here.

Tags: ,

No Comments yet »

April 26th 2005

Social computing symposium: BlogTrace demo

I’m presenting today our work on BlogTrace (Anjo Anjewierden does most of the hard work, but there are others as well - see below). Some links that could be interesting:

  • Mapping knowledge flows in weblogs (with Anjo, Rogier Brussee & Robert de Hoog)
  • Mapping weblog communities (work with Anjo and Stephanie Hendrick)
  • Things I’m not showing, related

  • Background
  • Archived version of this entry is available at http://blog.mathemagenic.com/2005/04/26.html#a1566; comments are here.

    Tags: , ,

    No Comments yet »

    January 29th 2005

    BlogTrace

    Anjo shares details about BlogTrace, weblog analysis tool we are working on (as you can see from Anjo’s post my main contribution is motivating the work and then going for a vacation :)))

    There are too many specific comments I have, so at this moment just an image representing BlogTrace architecture. Read Anjo’s post for more details.

    Archived version of this entry is available at http://blog.mathemagenic.com/2005/01/29.html#a1494; comments are here.

    Tags: ,

    No Comments yet »

    January 28th 2005

    Ontological fingeprinting: documents or people

    Anjo gives a bit of insight into our internal discussions on uses of ontologies:

    Andy Boyd came up with a wonderful new term: “ontological fingerprinting” and to illustrate how imaginative he is: zero hits on Google! Suppose one has an ontology (lexicon, thesaurus) and some software that can determine whether the terms in the ontology are present in a document. Applying the software, one gets a “fingerprint” of the concepts in the ontology for a given document. Comparing fingerprints for different documents, such is the assumption, provides a better metric of the similarity between these documents than comparing plain words. Ideas like this simply have to be tested in practice. Fortunately, Andy is making available a lot of real data to try it.

    I like the term, but find it a bit misleading: usually documents do not have fingers :)

    I’d associate the term with people - you may think of “ontological fingerprint” of a person, which could be something like conceptualisations produced by Sigmund based on analysis of weblog posts written by someone, set of personal categories someone uses to classify a document or mapping one’s documents to a shared ontology. Then you can look for others with similar “fingerprints” (this was one of uses I imagined for Sigmund, but didn’t have such a nice term to talk about it :).

    May be we should rather talk about “ontological abstract” in case of documents…

    Archived version of this entry is available at http://blog.mathemagenic.com/2005/01/28.html#a1493; comments are here.

    Tags: , ,

    1 Comment »

    • Welcome!

      I have not been blogging for a while. Between working on the chapters of my PhD dissertation and being a happy mom there wasn't much time to fix blog bugs. Finally I managed: this is brand new Wordpress blog; old Radio archives live next to it [quotes in imported posts are broken, I'm slowly fixing that]. It will take a while to make it nice and beautiful, but at least now I have a space to write.
    • Archives

    • Categories