# Biblio

The study of large, “big data” networks is becoming increasingly common and relevant to our understanding of human systems. Many of the studied networks are drawn from social media and other web-based sources. As such, in-depth analysis of these dynamic structures e.g. in the context of cybersecurity, remains especially challenging. Due to the time and resources incurred in computing network measures for large networks, it is practical to approximate these whenever possible. We present some approximation techniques exploiting any tractable relationship between the measures and network characteristics such as size and density. We find there exist distinct functional relationships between network statistics of complex “slow” measures and “fast” measures, such as the linkage between betweenness centrality and network density. We also track how these relationships scale with network size. Specifically, we explore the effi- cacy of both linear modeling (i.e., correlations and least squares regression) and non-linear modeling in estimating the network measures of interest. We find that sparse, but not severely sparse, networks which admit sufficient entropy incur the most variance in the network statistics and, hence, more error in the estimation. We review our approaches with three prominent network topologies: random (aka Erdos-R ˝ enyi), Watts- ´ Strogatz small-world, and scale-free networks. Finally, we assess how well the estimation approaches perform for sub-sampled networks.

Social media data and other web-based network data are large and dynamic rendering the identification of structural changes in such systems a hard problem. Typically, online data is constantly streaming and results in data that is incomplete thus necessitating the need to understand the robustness of network metrics on partial or sampled network data. In this paper, we examine the effects of sampling on key network centrality metrics using two empirical communication datasets. Correlations between network metrics of original and sampled nodes offer a measure of sampling accuracy. The relationship between sampling and accuracy is convergent and amenable to nonlinear analysis. Naturally, larger edge samples induce sampled graphs that are more representative of the original graph. However, this effect is attenuated when larger sets of nodes are recovered in the samples. Also, we find that the graph structure plays a prominent role in sampling accuracy. Centralized graphs, in which fewer nodes enjoy higher centrality scores, offer more representative samples.