Anjo on supporting blog research:
Blog research, seems to center around the following themes:
- Communities. Or “virtual settlements” see the recent paper by Lilia Efimova and Stephanie Hendrick.
- Conversations. A set of posts, distributed over several weblogs, which relate a particular topic.
- Language analysis. Analysis of the vocabulary used in a weblog, for example to classify favourite topics of the blogger. Sigmund is an example.
Support for researching these themes requires different kinds of information from weblogs. Communities mainly requires link data, Conversations in addition requires shallow text analysis of particular posts and Language analysis obviously requires (all) full posts.
The question therefore is whether it is possible to create a Blog Research Repository that accommodates the above themes. The data acquisition methods described in the paper by Lilia and Stephanie illustrate that blog research is by-and-large only supported by hard work and regular expressions (or tools that know what regular expressions are :-)).
Motivated by the themes, and practical considerations, the proposal is to organise the repository around the following types of data-sets:
- Structure. This is essentially the same as an RSS feed without the content of the posts, but with all links that can be found in posts.
- Content. Identical to full post RSS feeds.
- Abstractions / Aggregations. Any number of data-sets that contain abstractions or aggregations on a weblog for a particular research purpose. For example, Sigmund requires a data-set that contains the relation between terms in posts.
The practical considerations regarding efficiency are that the structure can be kept in memory for a fairly large set of weblogs (say 10,000), the content can be retrieved from disk on demand, and the abstractions can be defined on-the-fly.
The repository should use public standards for representation. For the structure RDF(s) appears the obvious choice. A basic structure that includes classes like weblog, post, link (etc.) provides a starting point that can be refined. Content is represented as the de facto standard RSS 1.0. Abstractions and aggregations are represented in RDF where possible to preserve the relation to the structure and content.
Brian Dennis on parallel effort:
So I find this to be really odd. At the WWW 2004 blogging ecosystem workshop, Cameron Marlow (blogdex) and Maciej Ceglowski (blogcensus) propose and present a new, open, blog indexing service, called upflux. Even though the service is vapor, there is zero mention of it in the blogosphere.
I wonder if there are others working on the same. Hope for synergies between different efforts: as a researcher I just want to spend a bit less time chasing teasing data and a bit more analysing it :)
This post also appears on channel weblog research
Tags: blog research
Archived version of this entry is available at http://blog.mathemagenic.com/2004/11/22.html#a1439; comments are here.