Estimating Network Metrics via Random Walk Sampling

Zheng, Minhui

doi:https://doi.org/10.21985/n2-ky4v-8s06

Work

Estimating Network Metrics via Random Walk Sampling

Public

Download PDF

Download All Files (.zip)

In this thesis we present methods for estimating network metrics via random walk sampling. More specifically, we generalize the Hansen-Hurwitz estimator and the Horvitz-Thompson estimator to estimate the shortest path length distribution (SPLD), closeness centrality ranking, and clustering coefficients of a network. Those are important metrics to a network, but when a network is large measuring the exact value is computationally expensive. Therefore we adopt random walk sampling to collect information as we explore the network, and then provide estimations for these metrics. Inspired by the strong ability of random walks to uncover shortest paths in a network, we first propose estimators for the shortest path length distribution (SPLD). There are two problems associated with this estimating process: 1) pairs of nodes (dyads) are sampled with unequal probabilities by a random walk, 2) the actual shortest path length (SPL) cannot by observed from the induced subgraph. To deal with the unequal selection probabilities issue, we generalize the Hansen-Hurwitz estimator and Horvitz-Thompson estimator (and their ratio forms) and apply them to the sampled dyads. Based on theory of Markov chains we prove that the selection probability of a dyad is proportional to the product of the degrees of the two nodes. To approximate the actual SPL for a dyad, we use the observed SPL in the induced subgraph for networks with large degree variability, i.e., the standard deviation is at least two times of the mean, and we estimate the SPL using landmarks for networks with small degree variability. We find that an estimator based on a random walk with at least 20% sampling budget can achieve high estimation accuracy, and save 94% to 96% of the computational time. To the best of our knowledge, this is the first non-parametric algorithm that can estimate the SPLD of a network with- out knowledge of the degree distribution. Using the same techniques to account for unequal selection probabilities and approxi- mating SPLs between sampled nodes, we then estimate the closeness centrality of sampled nodes. But in order to estimate the closeness centrality ranking of a node, which is more of interest rather than the closeness centrality value, we introduce two more steps in the algorithm. We first apply a weighted kernel estimator to estimate the smooth popula- tion cumulative distribution function (CDF) of closeness centrality, and then compute the estimated closeness centrality rank of a node from that estimated CDF. This algorithm provides a continuous function as an estimate for the population CDF of closeness centrality and an accurate estimate for the rank of closeness centrality of each node in the network. We finally look at the clustering coefficients of a network. The clustering coefficient of a graph measures the average probability that two neighbors of a node are themselves neighbors. People have defined the global clustering coefficient (GCC) and the local clustering coefficient (LCC). The global clustering coefficient (GCC) is the fraction of paths of length two that are closed in the network, and the local clustering coefficient (LCC) for a single node is the fraction of pairs of neighbors of the node that are connected. The average LCC (ALCC) is the unweighted average LCC, while the GCC is equal to weighted average of the LCCs of the nodes. We generalize the Hansen-Hurwitz estimator to estimate the GCC and the ALCC. By simulation studies and applications to real networks, we find that if we can observe all neighbors of a sampled node and count the exact number of connections among the neighbors, the estimators for both the GCC and the ALCC will be unbiased with small variance.

Creator