10.) For a cluster containing a single paper, the centroid is simply the reference vector of the paper itself. (Our clustering method is thus a standard centroid-based agglomerative clustering technique based on cosine similarity ref. When merging two papers or clusters, we represent the new cluster by the normalized sum of all of the individual papers' reference vectors, called the “centroid” of the cluster. To get a distance measure between papers, we simply use 1- the cosine, so the distance between papers ranges from 0 to 1. So, if two papers have no references in common, then their similarity is minimal, i.e., 0 (90° angle) two papers citing exactly the same set of papers have maximal similarity, i.e., 1 (0° angle). r q represents the inner product of r p and r q and|| r p|| represents the length of vector r p.We use the notion of natural communities to show that we can track these natural communities effectively over time, and can therefore characterize the temporal evolution of the network. We refer to such stable clusters as natural communities. Moreover, these stable clusters appear to correspond to the true underlying community structure of the network. Fortunately, we found that, when performing a series of agglomerative clustering runs, each run on slightly perturbed input data, one can identify a stable set of clusters that occur in a significant proportion of the clusterings. These random clusters obscure the real temporal changes. In the clusterings of our linked network data, we found there are a large number of relatively random clusters that do not correspond to real community structures. If the clusterings are very sensitive to small perturbations of the input data, distinguishing between “real” changes versus “accidental” changes in the higher-level structure becomes difficult, if not impossible. However, in tracking changes over time, we need to be able to find corresponding communities in clusterings taken from the data at different points in time. For obtaining a static view of the higher-level structure of the data, such instabilities may be acceptable because the resulting hierarchy often already reveals interesting structure. Clustering methods can be surprisingly sensitive to minor changes of the input data. This approach places a new burden on the underlying clustering method. By clustering the network at different points in time, we study its temporal evolution. In our approach, we use agglomerative clusterings of the linked network. We demonstrate that such natural communities allow us to identify emerging communities and track temporal changes in the underlying structure of our network data. By identifying the subset of clusters that remain stable under multiple clustering runs, we get the set of natural communities that we can track over time. However, certain subsets of papers, called natural communities, correspond to real structure in the CiteSeer database and thus appear in any clustering. One reason for this is that the order in which papers within communities are combined is somewhat arbitrary. However, small perturbations of the CiteSeer data lead to significant changes to most of the clusters. Tracking changes over time requires a clustering algorithm that produces clusters stable under small perturbations of the input data. We examine a large real-world data set: the NEC CiteSeer database, a linked network of >250,000 papers. We are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters (communities) over time.
0 Comments
Leave a Reply. |