Archive for the 'Digital traces' Category

July 7th 2008

Comparing weblog text to the PhD dissertation via tagclouds

About a year ago I looked for Tools to find similarity between two texts (weblog and papers) - I wanted to find a relatively objective way to judge how much of my weblog writing ends up in the dissertation.

Between other things I experimented with generating and comparing tagclouds from texts that were supposed to correspond to each other. I tried several tools, but ended up with tagCrowd since it allowed using generic and custom-made lists of stop words.

As an experiment I used text of five dissertation chapters (draft versions as of April 17, 2008) and the text of blog posts coded as corresponding to those chapters to generate a visualisation of most frequent words in each case. After removing stop words (general English plus those from my own list that I was stupid enough not to save) 65 most frequent words are visualised.

For example, two tagclouds below are those from the blogposts related to the Microsoft study and the draft chapter with the results of it.
Tagcrowd: blogposts related to chapter 6 (Microsoft)Tagcrowd: current draft chapter 6 (Microsoft)

In total I had 5 pairs of visualisations. I then mixed them and asked five people familiar with my research (supervisors and collaborators) and eight students (taking a class with Anjo) to find matching pairs. The results are below.

Total pairs Correctly matched pairs Correctly matched pairs, %
Chapter 1. Introduction 13 10 77%
Chapter 2. Methodology 13 11 85%
Chapter 3. Ideas 13 6 46%
Chapter 4. Conversations 13 10 77%
Chapter 5. Microsoft 13 9 69%
Total 65 46 71%
by people familiar with the research 25 20 80%
by people not familiar with the research 40 26 65%

Some comments:

  • I guess there is a connection between PhD chapters and blogposts :)
  • The high score for the methodology chapter is explained by its qualitative difference from the rest of the dissertation.
  • The low score for this chapter is explained by the fact that the coding of weblog entries in relation to chapters was done prior to writing it. As a results it included many “might be relevant” posts, while for other chapters the focus was more clear. In addition, the draft version of the chapter used to generate the visualisation was the first draft, while in other cases those were revised several times.

Tagcrowds: current state of the dissertationIt was nice to see that although many of the visualisations looked similar (with blogging and weblog being big ;) it was actually possible to match the pairs. But the nicest thing was simply making all those pictures, laying them on the floor and thinking that I actually had some version of 5 chapters out of the 7 :)

Tags: , , ,

3 Comments »

June 22nd 2008

Reasons for using weblog to keep information bits

While figuring out how to summarise 30+ pages chapter on my blogging practices for a talk I’m giving tomorrow I realised that it could make sense to share some of the bits from it here. This one is on the reasons to use weblog to keep information bits, using the list of factors for choosing a strategy for Keeping found things found on the web.

Portability / Number of access points. Using weblog for organising my thinking resources fits well my preferences for web-based applications in general, since I use multiple computers and I’m very likely to be online while working. In this respect server-based weblog provides much better alternative for organising my ideas than any desktop application, since I can access when I’m online regardless of the location.

Preservation of information in its current state / Currency of information. To a degree weblog allows both at the same time. I usually quote most relevant bits of external resources, so those quotes are preserved in their current state. The quotes are accompanied with a link to the original (if online), so an updated version is easily accessible. If the original disappears or is moved, I could use the quote for find it (usually it’s an updated location easily found with any search engine, otherwise I use Internet Archive Wayback machine).

Context (remembering why it was saved) / Reminding. Most of my weblog posts contain a commentary that provides a context for a specific thought or reference; I also use multiple strategies to establish connections between different posts. That context is enough to recall why certain weblog post is there and to remember to use it at a later stage (although not as effective as to-do lists to serve as a reminder of an urgent task).

Ease of integration into existing structures. From one side, my weblog is a stand-alone tool that requires its own organisation and archiving. From another, it is essentially a set of webpages connected by links, with permalinks, metadata and underlying standards. It is an integral part of my online presence (as evident by searching for my name in any search engine) and references to it could be easily included in a variety of other documents or systems.

Communication and information sharing. Sharing information via a weblog is not a specific activity, but a by-product of writing. In most cases it’s an advantage; however it limits potential uses of blogging when access to some of the weblog posts have to be restricted. Weblog is not good for a goal-driven communication to a known few people, but it is a perfect instrument for non-intrusive sharing of ideas in cases where potential audience is not well defined.

Ease of maintenance. In my case most maintenance problems are technology-related and they are the result of choosing weblog platform that provides high degree of freedom and flexibility.

Tags:

4 Comments »

December 16th 2007

WikiDashboard: transparency, privacy and other consequences of measurement

Similar to Stowe Boyd and Jack Vinson I’m not a big fan of wikis: while they are good for collective writing when authorship of specific contributions is not important, there are much more cases where it’s essential to know who makes what changes. Of course, the history of edits is there, but it’s just too unhuman to be used systematically.

However, given that the traces are there getting tools to analyse them is just a matter of time. WikiDashboard (thanks to Jane McConnell) is a good example of what is possible: if you use it to browse Wikipedia, each page is enhanced with a visualisation representing top ten users who edited it.

Motivation: The idea is that if we provide social transparency and enable attribution of work to individual workers in Wikipedia, then this will eventually result in increased credibility and trust in the page content, and therefore higher levels of trust in Wikipedia.

I was curious to see how it works, so I used it to check who edits Knowledge_Management page:

Wikidashboard: Knowledge management

And then click on User:Snowded:

Wikidashboard: user:snowded

The second screenshot is more interesting: it’s a user page that shows what pages he edits most. As I was suspecting, the user is Dave Snowden and you can see not only which pages he edits, but also that he seems to have given up editing KM page (or that visualisations are not up to date, since this is not the case).

Well, on one hand I’m happy to see tools that add transparency and give credits to individual contributors. On the other hand, I wonder what Dave thinks of it. It’s not only about privacy concerns, but also about the potential of tools like this for messing up contributor motivations and all other consequences of measurement.

The people behind Wikidashboard are interested in the patterns that it might show, also inside companies:

We’re curious of how the Web community will use this tool to surface social dynamics and editing patterns that might otherwise be difficult to find and analyze in Wikipedia. We are also interested in applying this tool to Enterprise Wikis.

I’m interested in those patterns too, but even more in the secondary effects of having tool like that in a corporate settings. I still remember the feedback we’ve got on our innocent prototype that visualised some patterns in a corporate discussion forum. Then I was surprised not that much with the “Big Brother” title for our application, but with a little detail: community members didn’t want to have visible the number of messages they wrote next to their names, the feature that you can see often in public forums. Funny enough, they didn’t mind having a list of messages they wrote displayed next to their names. Numbers are easy to judge and easy to turn into targets, while it’s pretty clear that contribution it not about that.

See also:

Archived version of this entry is available at http://blog.mathemagenic.com/2007/12/16.html#a1965; comments are here.

Tags: , ,

No Comments yet »

November 14th 2007

Getting more by reading less blogs: some thoughts on ‘Cost-Effective Outbreak Detection in Networks’

Matthew Hurst on the most important blogs for efficient readers:

A group of researchers at CMU have been considering a notion of blog importance based on how likely a set of blogs is to ensure that you will be informed of topics bursting in the blogosphere. By analogy, they consider a graph of water pipelines. Their paper - Cost-Effective Outbreak Detection in Networks Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance - poses the problem:

Given a water distribution network, where should we place sensors to quickly detect contaminants? Or, which blogs should we read to avoid missing important stories? These seemingly different problems share common structure: Outbreak detection can be modeled as selecting nodes (sensor locations, blogs) in a network, in order to detect the spreading of a virus or information as quickly as possible.

As a result of this work, the authors have published some blog lists which answer a fundamentally important question in terms of weblog reading habits: Which weblogs should I read to be most up to date? The lists answering this question - generated by the approach described in their paper - come in a number of varieties to be found on the project’s page.

I scanned (skipped most of the math :) through the extended version of the paper and this is something I would love to see applied to niche blogging networks. For example, starting from a subset of weblogs that mention topic X or, better, those that participate in a discussion (cascade) that mentiones topic X.

A few points relevant from the practical perspective - having a tool that helps a blogreader to make a selection of blogs to read (my expectations in that respect are pretty high given that Natalie Glance is working for Google now :)

1. “Costs” of reading. The authors played with optimising the number of blogs and number of posts one reads. Assuming that reading less blog posts is more cost-effective, the algorithm shows that “the popular blogs might not be the most effective way to catch relevant information cascades” (p.23). Instead, it makes more sense to read “good summarizer blogs that may not be very popular, but which, by using few posts, catch most of the important stories propagating over the blogosphere” (p.15).

2. Predicting the future. From a reader perspective one would like to have a recommendation of blogs that will cover most interesting stories in the future. From what I understood the algorithm does not work that well for making those predictions. The authors optimised the performance by including only big blogs (= at least one post per day), but I wonder if there are some other alternatives.

Anyway, I guess I should go back to my PhD writing and wait patiently till people who read the paper without skipping the math do something with it. So far I’m happy that the paper promises lots of interesting developments and that it also makes me feeling less guilty with our alternative approach to vaccination by suggesting that “uniform ummunization strategy corresponds to randomly placing sensors in a water network” (p.22), which in not optimal :)))

Archived version of this entry is available at http://blog.mathemagenic.com/2007/11/14.html#a1953; comments are here.

Tags: , , ,

No Comments yet »

June 12th 2007

Tools to find similarity between two texts (weblog and papers)

I’m playing with an idea of comparing (parts of) my weblog with some of my published papers (and with the dissertation as a whole when I’m done). So far I’m interested in two things:

  • how much of the text is reused
  • how conceptually close two texts (weblog and a paper) are

Thought of a couple of ways to do so:

  • One way would be to use all kinds of weblog analysis tool from Anjo. One of the difficulties there would be to figure out how to find similarities between weblog text, which is relatively self-contained microcontent pieces, and linear “build upon previousely said” academic papers.
  • Another option would be to use some plagiarism detection tools. Only wonder if you can configure those to compare target paper with a specific weblog, rather than with “everything published”.

Any ideas?

Archived version of this entry is available at http://blog.mathemagenic.com/2007/06/12.html#a1909; comments are here.

Tags: , , ,

No Comments yet »

November 22nd 2006

Open issues for research/thinking on communities

Had a pleasure to talk with Nancy on her work on technologies for communities. Some things are still hanging out in my head, so I guess I just write them here to move on.

Open issues for research/thinking on communities (communities of practice; KM perspective).

Definitions. Ton cites Marc Smith:

… let’s shelve the word ‘community’ and use and study the term collective action instead. There are over 150 definitions of community by social scientists. If we (the social scientists) are not able to decide what it is, maybe everybody else should not be using the word either…

I agree with both that there are no good definitions and I like ‘collective action’ as a term, but I think it doesn’t work if you want to talk about specifics. It could include anything between a loosely coupled network, a community with shared language and practice or a project group with tight deliverables and deadlines. The boundaries between those are fluid, but they (at least in the extremes) are different in many respects (e.g. relational density, levels of trust, shared understanding, goal-orientedness, etc.)

Bottom-up evolution vs. top-down control in supporting communities. See the discussion at Dave Snowden’s blog.

Personal vs. social in community tools. Most of the community tools are group-focused (although Nancy is right, it’s getting more and more blurred). However, many of us are members of multiple communities and have to deal with different group tool configurations for all of them. Technology-wise I’d love to see more work on something like personal learnining environments (slides with more) for networking and collaboration: a toolset that would allow me to participate in different social spaces without learning yet another interface.

Aggregation of digital traces and social effects of those. Digital traces we leave eventually get aggregated and fed back to the social spaces we participate in or to some members of those (think of a community moderator who has access to stats on your activity in a community). They change knowledge we have about each other and eventually change the dynamics of our relationships and interactions (think of gaming the ratings or effects of metrics to measure community things in a corporate context). This is going to be bigger and scarier (at least for those people like me :), so we need to know more about it.

Archived version of this entry is available at http://blog.mathemagenic.com/2006/11/22.html#a1857; comments are here.

Tags: , , ,

No Comments yet »

November 8th 2006

Understanding weblog communities through digital traces: a framework, a tool and an example

Anjo Anjewierden and Lilia Efimova. Understanding weblog communities through digital traces: a framework, a tool and an example. In Proceedings International Workshop on Community Informatics (COMINF 2006), pp. 279-289, Montpellier, 2006 (November). Springer, LNCS 4277.

Abstract. Often research on online communities could be compared to archaeology (Jones, 1997): researchers look at patterns in digital traces that members leave to characterise the community they belong to. Relatively easy access to those traces and a growing number of methods and tools to collect and analyse them make such analysis increasingly attractive. However, a researcher is faced with difficult tasks of choosing which digital artefacts and which relations between them should be taken into account, and how the finding should be interpreted to say something meaningful about the community based on the traces of its members.

In this paper we present a framework that allows categorising digital traces of an online community along five dimensions (people, documents, terms, links and time) and then describe a tool that supports the analysis of community traces by combining several of them, illustrating the types of analysis possible using a dataset from a weblog community.

I should blog it a while ago :)

Anyway, the paper is good to get an idea of what we (Anjo, me, Rogier Brussee and Robert de Hoog) have been doing behind the scenes in respect to understanding and visualising patterns in weblog communities.

For more:

Hmm, given how many bits and pieces are already there I should write more on it…

Archived version of this entry is available at http://blog.mathemagenic.com/2006/11/08.html#a1852; comments are here.

Tags: , , , ,

No Comments yet »

October 19th 2006

Weblog-mediated relationship: a co-constructed narrative

It’s online as promised.

Artefacts of a weblog-mediated relationshipEfimova, L. & Ben Lassoued, A. (forthcoming) Weblog-mediated relationship: a co-constructed narrative, in S. Holland (Ed.) Remote relationships in a small world, Peter Lang Publishing.

Weblogs provide a fertile ground for finding interested others and getting into closer contact. As visible from our case, the beginning of this process can be asymmetrical and doesn’t necessary imply a commitment to communicate from both sides, but over time blogging strangers can turn into blogging friends. Based on our own case we cannot provide definite answers why this happens, but there are a few factors that did it for us: reciprocity of potential benefits from communicating to each other, vulnerable writing and an ability to go beyond blogging in our choice of communication media.

A few notes:

  • It refers to lots of existing bits and pieces:
  • blog posts/comments that are treated as artefacts
    • most of those are linked from the text and I’ll see if I can make a visualisation with linking (since not all of those appear referenced in the text)
    • those links (for obvious reasons) will not appear in the printed version
  • meta-pieces (drafted fragments of the paper)
  • visuals on Flickr
  • I have permission to post it online, but only till the book is published (somewhere in 2007). I don’t really get the logic of it, but anyway - make sure you read it before that :)
  • Archived version of this entry is available at http://blog.mathemagenic.com/2006/10/19.html#a1846; comments are here.

    Tags: , ,

    No Comments yet »

    October 9th 2006

    NOAGGREGATE: what if I don’t want my digital bits to be connected at one place?

    Ton in Weaving Webs: How to Quickly Find Somebody’s Online Traces?:

    As I do after each conference I am currently busy finding people on-line and adding them to my ’social filter’ after BlogTalk Reloaded. Basically that means finding their on-line presences and adding them to my feedreader, and connecting to them in different environments such as Plazes, Skype, Flickr, OpenBC/Xing, LinkedIn, 43People etc. Weaving them into my social web so to speak.

    Ton is not alone in that: each f2f meeting I participate in follows with a surge of “let’s be friends” requests over many platforms. It’s becoming a practice that eventually will be supported by some tool that Ton wants:

    Would there be a way to create a search agent that takes the name of a person you’ve met? Ideally you would provide such a search agent with your own account data of all the environments you are part of that you want to have searched. And then it comes back with a number of likely search results that might contain any or all of the following for instance:

    Possible blogs of that person
    Possible Flickr Feed, or 23 feed
    Possible Skypename
    Possible IM names
    Profile in OpenBc.com
    Profile in LinkedIn.com
    Profile at 43people.com
    Possible Plazes account
    Possible del.icio.us account

    I have a very mixed feelings about it, similar to those in the comment by Marc Canter:

    Clearly their is a need for such a search function, but it steps right onto the issue of privacy and security on the web.

    For me, as someone who wants to ‘bookmark’ digital bits of people I met offline, having a tool like that would be great. For me, as a one ‘being searched for’, it sounds like a nightmare: I’m not happy when others connect my online dots on one page, especially if I don’t know them.

    For me leaving my bits online is a conscious choice, but leaving them disintegrated ‘all over the place’ is a consious choice as well: if I make choices to share specific things in specific contexts and not put all on the same page I have a reasons to do so. And I’d like those reasons to be respected by whatever search tools (as they currently supposed to respect NOINDEX and NOFOLLOW of web-pages). At the end I want to have at least some rights over my own bits (e.g. digital traces not being aggregated without explicit content)…

    So, coming back to Ton’s problem - one of the options that I could imagine is ‘Plaze-based’ search, an advanced version of something I experienced at SHiFT:

    • there I could easily see others on the same network and add them as contacts
    • once they confirmed I could get their basic info (like IM names or web-site links)
    • I can also see ‘plazes we have in common’, which could provide some context on the specific location (often associated with an event) where we have met

    Of course, this is yet another centralised system (with all the problems of that), but at least it does a few things:

    • makes ’search’ much easier (by showing only people who are at the same location with me at the moment)
    • provides people with choice of how much ‘aggregated in one space’ information they want to share with me
    • provides me with some clues about the history of our relationship (if you want to get into that deeper - make sure to check danah boyd’s Master thesis for the idea of Digital Mirror, pp. 53-59)

    So, two questions regarding all these:

    • Can my Plaze-based search be generalised to any cross-platform search?
    • Are there any chances that eventually I will be able to add NOAGGREGATE tag or something like that (’aggregate only for my contacts’, ‘ask first’, etc.) to my digital bits to control how they are displayed? Anything practical I can do in that respect? [pinging Suw at ORG]

    Archived version of this entry is available at http://blog.mathemagenic.com/2006/10/09.html#a1842; comments are here.

    Tags: ,

    No Comments yet »

    September 6th 2006

    Chumbies join nabaztags :)

    While Europe is conquered by nabaztags (like our own), North America seems to become mesmerised by chumbies. In case you don’t know yet…

    …chumby, a compact device that can act like a clock radio, but is way more flexible and fun. It uses the wireless internet connection you already have to fetch cool stuff from the web: music, the latest news, box scores, animations, celebrity gossip…whatever you choose. And a chumby can exchange photos and messages with your friends. Since it’s always on, you’ll never miss anything.

    First seen at danah boyd (who is “a serious alpha-geek hacker, a clever crafter or an accomplished Flash animator” as only those people can play with “precious few prototypes” :)

    Archived version of this entry is available at http://blog.mathemagenic.com/2006/09/06.html#a1828; comments are here.

    Tags:

    No Comments yet »

    Next »

    • Welcome!

      Like my house right now this blog is loved, but neglected space: finishing my dissertation and being a happy mom doesn't leave much energy for anything else. I'm almost there, starting to look forward to "after the PhD" life, like moving to an unknown country...
    • Archives

    • Categories