Bias Correction in Clustering Coefficient Estimation
Roohollah Etemadi, Jianguo Lu
School of Computer Science, University of Windsor
Windsor, Ontario N9B 3P4. Canada


    Clustering coefficient (C) is an important structural property to understand the complex structure of a graph. Calculating C is computationally an expensive task. Thereby, sampling based methods have attracted substantial research for estimating C, and the closely related metric, the number of triangles. A widely used estimator for C is biased. We quantify the bias using Taylor expansion and find that the bias can be determined by the number of shared wedges and triangles in the sample. Based on the understanding of the bias, we give a new estimator that corrects the bias. The results are derived analytically and verified extensively in 56 networks ranging in different size and structure. The experiments reveal that the bias ranges widely from data to data. The relative bias can be as high as 4 percent or can be negative. For most of the graphs, the bias is small, although every graph does have a bias as quantified by our analytical results. Negative or small biases occur in online social networks where clustering coefficient is typically high. Positive and large biases typically occur in Web graphs, where there are nodes with high degrees but few neighboring nodes connecting with each other.


    Graph sampling; Clustering Coefficient; Estimating algorithms; Bias; Variance.