Updated: 6/30/2005; 11:34:13 PM.


...giving birth to learning...
If you search for mathemagenic that has nothing to do with weblogs try this

Earlier | Home | Later

  Monday, November 22, 2004

  Blog research repository?

Anjo on supporting blog research:

Blog research, seems to center around the following themes:
  • Communities. Or "virtual settlements" see the recent paper by Lilia Efimova and Stephanie Hendrick.
  • Conversations. A set of posts, distributed over several weblogs, which relate a particular topic.
  • Language analysis. Analysis of the vocabulary used in a weblog, for example to classify favourite topics of the blogger. Sigmund is an example.

Support for researching these themes requires different kinds of information from weblogs. Communities mainly requires link data, Conversations in addition requires shallow text analysis of particular posts and Language analysis obviously requires (all) full posts.

The question therefore is whether it is possible to create a Blog Research Repository that accomodates the above themes. The data acquisition methods described in the paper by Lilia and Stephanie illustrate that blog research is by-and-large only supported by hard work and regular expressions (or tools that know what regular expressions are :-)).

Motivated by the themes, and practical considerations, the proposal is to organise the repository around the following types of data-sets:

  • Structure. This is essentially the same as an RSS feed without the content of the posts, but with all links that can be found in posts.
  • Content. Identical to full post RSS feeds.
  • Abstractions / Aggregations. Any number of data-sets that contain abstractions or aggregations on a weblog for a particular research purpose. For example, Sigmund requires a data-set that contains the relation between terms in posts.

The practical considerations regarding efficiency are that the structure can be kept in memory for a fairly large set of weblogs (say 10,000), the content can be retrieved from disk on demand, and the abstractions can be defined on-the-fly.

The repository should use public standards for representation. For the structure RDF(s) appears the obvious choice. A basic structure that includes classes like weblog, post, link (etc.) provides a starting point that can be refined. Content is represented as the de facto standard RSS 1.0. Abstractions and aggregations are represented in RDF where possible to preserve the relation to the structure and content.

Brian Dennis on parallel effort:

So I find this to be really odd. At the WWW 2004 blogging ecosystem workshop, Cameron Marlow (blogdex) and Maciej Ceglowski (blogcensus) propose and present a new, open, blog indexing service, called upflux. Even though the service is vapor, there is zero mention of it in the blogosphere.

I wonder if there are others working on the same. Hope for synergies between different efforts: as a researcher I just want to spend a bit less time chasing teasing data and a bit more analysing it :)

This post also appears on channel weblog research

More on: blog research 

  Weblog research challenges: 'teasing' data

The public nature of weblogs makes them an easy target for a researcher, providing a record of personal interest and engagement in the posts, as well as links that indicate influences and relations with other bloggers. Most weblogs have a simple and well-defined structure (e.g. the weblog post usually has a title, a permalink and a date/time stamp), generate web-feeds (RSS or Atom) representing weblog content in machine-readable format (XML or RDF), or notify centralised weblog tracking tools (e.g. weblogs.com) about updates.

The relatively simple structure of weblogs and widespread adoption of standards (RSS, XML-RPC, Blogger API) by weblog tool providers enable a variety of tools and services that allow tracking and analysing weblogs. For example, one can visualise a weblog neighbourhood (related weblogs) at Blogstreet, check weblog popularity ranking at Technorati, track ideas contagiously spreading in a weblog community at Blogdex or read a selected subset of weblogs online at Bloglines.

Publicly available weblog data and a large number of tools to analyse it raise expectations about availability of this data for research purposes, although the practice of weblog research is dramatically different (e.g. Anjewierden, Brussee, & Efimova, 2004; Herring et al., 2005, for explicit indications of challenges of obtaining weblog data). Most weblog tracking and analysis tools index only a subset of weblogs (e.g. those that registered with the system); include partial weblog data usually representing fresh updates (e.g. links from homepages or content from last 45 days); or index only data in machine-readable formats (e.g. RSS/Atom feeds that are not always present or include excerpts of weblog posts instead of full-text).

Developing data collection tools for a specific study meets a variety of challenges as well. These include distinguishing a weblog from other types of web-sites and taking into account differences between structure and layout of weblogs due to use of specific functionalities of different weblog platforms, user-modified templates or different practices of using weblog tools. As a result, many weblog researchers have to limit themselves to working with convenient samples (e.g. restricting data collection to a specific weblog platform as in Merelo-Geurvos, Prieto, Rateb, & Tricas, 2004) or rely on manual work that limits number of weblogs and weblog characteristics to be included in the analysis. Choices made for data collection in those cases can heavily influence the results of the analysis.


Other posts on Weblog research challenges

This post also appears on channel weblog research

More on: blog research 

  There will be no BlogTalk 3.0

Thomas Burg (via Martin Roell)

BTW: since many people asked about that. There will be no BlogTalk 3.0. I'm thinking of something broader and different. So I'm looking forward that someone will step forward to organize the next international weblog-conference.


  • if there is a need for a specific conference on weblogs in Europe?
  • what is that "broader and different" that Thomas has in mind? :)

This post also appears on channel BlogTalk 

More on: blog research BlogTalk 

Earlier | Home | Later

© Copyright 2002-2005 Lilia Efimova.

This weblog is my learning diary. Sometimes I write about things related to my work, but the views expressed here are personal and do not necessarily reflect the views of my employer.

November 2004
Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        
Oct   Dec

Edublog award 2004 as Best Research Based Blog. Click for more details...

Click to see the XML version of this web page. Click here to send an email to the editor of this weblog. Please, make sure that I recognise your name or you have a nice autorisation message - I tend to decline calls from people I don't know ;)

Locations of visitors to this page