msn

2008年6月20日 星期五

Statistical Analysis of the Social Network and Discussion Threads in Slashdot

We analyze the social network emerging from the user comment activity on the website Slashdot.
Using Kolmogorov-Smirnov statistical tests, we show that the degree distributions are better explained by log-normal instead of power-law distri- butions.
We also study the structure of discussion threads using an intuitive radial tree representation.
Threads show strong heterogeneity and self-similarity throughout the dif- ferent nesting levels of a conversation.
We use these results to propose a simple measure to evaluate the degree of con- troversy provoked by a post.

To improve the quality and the representativity of the resulting graph, we filter some of the comments according to the following four criteria:
1.The post: Under this assumption, no relations exist between the post’s author and its direct commentators, unless he also participates later in the discussion.
2.Anonymous comments were also discarded.
3.We discard very low quality comments with score −1.
4.Finally, we filter out self-replies, often motivated by a forgotten aspect or error fix of the original comment.

The size of the big cluster for strongly connected components is of course, smaller.
gesting that the small-world property is present in all of them.
The quantities are approximately one unit lower than the corresponding value for a random graph ℓrand.
The maximal distance D between two users is also very small.
Even for the undirected sparse case, it only takes a maximum of eleven steps to reach a user starting randomly from any other.
These results are also in accordance with similar studies of other traditional social networks.
To study the statistical level of cohesiveness we calculate the clustering coefficient C according to [23], and also its weighted version Cw [1].
We notice no significant differences between them.
Thus the number of messages interchanged between two users is not relevant to determine the clustering level.
The impact of having a weighted network is analyzed in more detail in Section 2.5.

we associate to each user a score, which is calculated by averaging over all the scores of the comments of the same user.
This quantity allows us to differentiate high-quality writers (those with high mean score) from regular-quality writers.
The initial score of a comment is generally 1 if it comes from a registered user or 0 if it is anonymous3.
Moderation can modify the initial score to any integer within the range [−1, 5].
To ensure a representative subset of the network, we only consider users who wrote at least 10 comments, a total of 18, 476 users, representing approximately 23%.
Note that the minimum score is 0, since we eliminate −1 comments.
The distribution shows an unexpected bimodal profile, with two peaks at mean scores 1.1 and 2.3.

We take a simple approach based on agglomerative clustering which takes benefit from the weighted nature of the Slashdot network [18].
We choose the dense undirected network and start our procedure with each node as an independent cluster.
Let λ denote the number of comments, so that pairs of users (i, j) who interchange a number of comments wij ≥ λ are included in the network, and the other connections are discarded.
Starting from the biggest value λ = λmax and progressively decreasing it, users are connected incrementally and communities can be obtained.
This simple procedure is equivalent to building a dendrogram and allows to browse through the community structure at different scales by changing the parameter λ.
The vast majority of pairs of users only exchanges a small number of comments whereas a few of them really maintain intense dialogues during the year.
This seems to be the reason why previous properties such as the clustering coefficient do not show significant differences between the weighted and the unweighted network.
The most discussing pair of users exchanged a total of 108 comments.

We can see that the biggest component grows very fast and the second biggest remains small, showing evidence of a giant cluster present in all scales.

An initial picture of the activity generated by posts can be found in previous studies [12].
Posts receive on average approximately 195 comments and there exists a clear scale in the number of comments a post can originate.
Half of them receive less than 160 contributions.
A small number of highly discussed ones, however, can trigger more than one thousand contributions.
The number of comments gives an idea of how the participation is distributed among the different articles, but is not enough to quantify the degree of interaction.
For instance, a post may incite many readers to comment, but if the author of a comment does not reply the responses to his comment, there is no reciprocal communication within the thread.
In this case, although users can participate significantly, we can hardly interpret that the post has been highly discussed.
On the other hand, a post with a small number of contributors but with one long dialogue chain will evidence a high degree of reciprocal interaction (albeit its general interest may be reduced).

For deeper nesting levels, comments can be fully shown (score 4 or above), abbreviated (score between 1 and 4) or hidden (score below 1).
We propose a natural representation of thread discussions which takes advantage of their structure.
Consider a post as a central node.
Direct replies to this post are attached in a first nesting level and subsequent comments at increasing nesting levels in a way that the whole thread can be considered as a circular structure which grows radially from a central root during its lifetime, a radial tree.
Figure 7 shows three snapshots of a radial tree associated to a controversial post which attracted a lot of users.
An analog example of a less discussed post can be seen in Figure 8.
More examples of trees are shown in Figure 9.
Their profiles are highly heterogeneous.
In some examples, only a huge number of contributions without replies appear in the first level, resulting in trees with high widths but small depths.
In other examples, however, there are only discussions between two users who comment alternatively giving rise to very deep trees with small widths.
Sometimes, the intensity of the discussion is translated to one of the branches because of a controversial comment which triggers even more reactions than the original post (e.g the post in the center of Figure 9).
Apart from being a useful tool for browsing and examining the contents of a highly discussed post, radial trees can be used to describe statistically how information is structured in a thread.
In Figure 10a we plot the distribution of all the extracted comments per nesting level for all posts.
This gives an idea about the relation between the width versus the depth of the trees.
The first two levels contain most of the comments and then their number decays exponentially in function of the depth.
The maximum depth was 17.

It is important to note that a definition of controversial is necessarily subjective.
However, indicators such as the number of comments received or the maximum depth of the discussions can be, among others, good candidate quantities to evaluate the controversy of a post, but suffer from some drawbacks as we will explain in what follows.
We therefore seek for a measure, as simple as possible which incorporates as many of these factors and is able to rank a set of posts properly.
The number of comments alone does not tell us much about the structure of the discussion.
There might be a lot of comments in the first level but very little real discussions, such as in the post of Figure 12a.
A better measure for the controversy of a post seems to be the maximum depth of the nesting.
But again that measure has some drawbacks.
Two users may become entangled in some discussion without participation of the rest of the community, increasing the depth of the thread.

Unlike the BBS network [24, 8] where discussions are unrestricted, the scoring system of Slashdot guarantees a high quality and representativity of the social interaction.
This particular feature allowed us to find a correlation between scores and number of received replies and to distinguish clearly between two classes of users: good writers who, on average achieve high scores for their comments, and regular writers.
The number of replies of a comment depends mostly on its quality (the score it achieved) but we find some weak evidence for user reputation influencing the connectivity in the network.
Good writers are more likely than regular ones to receive replies to occasional comments with low scores.
However, this effect is not strong enough to cause assortative mixing by score since the opposite is not true.
Regular writers can expect a similar number of replies as good writers to their comments with high scores, so there is no negative effect of a user’s reputation.

Comments:
1. I think the author should be able to take the results compare with some of the practices in the community and do some combination, maybe can help us gain a better understanding of the meaning behind these phenomena.

2. Perhaps the author can do more explanation about the findings, this can also stimulate the feeling of the reader to the information.