Weblog research challenges: 'teasing' data
The public nature of weblogs makes them an easy target for a researcher, providing a record of personal interest and engagement in the posts, as well as links that indicate influences and relations with other bloggers. Most weblogs have a simple and well-defined structure (e.g. the weblog post usually has a title, a permalink and a date/time stamp), generate web-feeds (RSS or Atom) representing weblog content in machine-readable format (XML or RDF), or notify centralised weblog tracking tools (e.g. weblogs.com) about updates.
The relatively simple structure of weblogs and widespread adoption of standards (RSS, XML-RPC, Blogger API) by weblog tool providers enable a variety of tools and services that allow tracking and analysing weblogs. For example, one can visualise a weblog neighbourhood (related weblogs) at Blogstreet, check weblog popularity ranking at Technorati, track ideas contagiously spreading in a weblog community at Blogdex or read a selected subset of weblogs online at Bloglines.
Publicly available weblog data and a large number of tools to analyse it raise expectations about availability of this data for research purposes, although the practice of weblog research is dramatically different (e.g. Anjewierden, Brussee, & Efimova, 2004; Herring et al., 2005, for explicit indications of challenges of obtaining weblog data). Most weblog tracking and analysis tools index only a subset of weblogs (e.g. those that registered with the system); include partial weblog data usually representing fresh updates (e.g. links from homepages or content from last 45 days); or index only data in machine-readable formats (e.g. RSS/Atom feeds that are not always present or include excerpts of weblog posts instead of full-text).
Developing data collection tools for a specific study meets a variety of challenges as well. These include distinguishing a weblog from other types of web-sites and taking into account differences between structure and layout of weblogs due to use of specific functionalities of different weblog platforms, user-modified templates or different practices of using weblog tools. As a result, many weblog researchers have to limit themselves to working with convenient samples (e.g. restricting data collection to a specific weblog platform as in Merelo-Geurvos, Prieto, Rateb, & Tricas, 2004) or rely on manual work that limits number of weblogs and weblog characteristics to be included in the analysis. Choices made for data collection in those cases can heavily influence the results of the analysis.
- Anjewierden, A., Brussee, R., & Efimova, L. (2004). Shared conceptualisations in weblogs. To be published in Proceedings of BlogTalk 2.0, Thomas N. Burg (ed.), Vienna, July 2004.
- Herring, S. C., Kouper, I., Paolillo, J. C., Scheidt, L. A., Tyworth, M., Welsch, P. et al. (2005). Conversations in the blogosphere: An analysis "from the bottom-up". Forthcoming in Proceedings of the 38th Hawaii International Conference on System Sciences (HICSS-38). Los Alamitos: IEEE Press.
- Merelo-Geurvos, J. J., Prieto, B., Rateb, F., & Tricas, F. (2004). Mapping weblog communities. Submitted to Computer Networks.
Other posts on Weblog research challenges
This post also appears on channel weblog research