Detection and analysis of large-scale Internet infrastructure outages
ABSTRACT
Our dependence on the Internet has rapidly grown much stronger than our comprehension of its underlying structure, global dynamics, operational threats, and overall network health. Wide- scale Internet service disruptions – even politically-motivated interference with Internet access in order to hinder anti-government organization – are not new. But the scale, duration, coverage, and violent context of the government-mandated country-level Internet censorship episodes in 2011 inspired scientific as well as popular interest in capabilities to not only detect but quickly and thoroughly characterize the causes of reachability problems.
We propose to apply successful results in analyzing recent large-scale Internet outages to the development, testing, and deployment of an operational capability to detect, monitor, and char- acterize such large-scale infrastructure outages. We have developed and demonstrated a method- ology that can identify not only which networks have been affected by an outage, but also which techniques have been used to effect a deliberate disruption (e.g., control plane vs. data plane inter- vention). We have also developed metrics to quantitatively gauge the geographic and topological extent of impact of geophysical disasters on Internet infrastructure, and techniques to thoroughly investigate the chronological dynamics of the outage and restoration. Our approach relies on: (1) the extraction of signal from a pervasive and continuous source of malware-induced background radiation in Internet traffic (IBR); and (2) combining multiple types of data (active probing, passive IBR measurement, BGP routing data, and address geolocation and registry databases) to assess the scope and progression of the outage.
We propose three tasks: (1) investigate and define strategies for combining multiple data sources to establish indicators that most effectively support detection, characterization, and root cause analysis of outage events; (2) define the requirements of a monitoring platform for the auto- mated detection and analysis of large-scale outages; (3) develop, test and experimentally deploy this system. The first task will investigate trade-offs among accuracy, precision, computational and storage efficiency, and practical applicability of previously proposed and new metrics. Re- sults from Task 1 will inform Task 2, including how to trigger targeted measurements. We will pursue Task 3 in parallel, iterating and refining metrics and techniques as we experiment with operational deployment.
Intellectual merit. This project will result in an experimental operational deployment to val- idate and extend an empirically-grounded methodology for detection and analysis of large-scale Internet outages. In addition to improving our understanding of how measurements yield insights into network behavior, and strengthening our ability to model large scale complex networks, use of such a system will also illuminate infrastructure vulnerabilities that derive from architectural, topological, or economic constraints, suggesting how to mitigate or eliminate these weaknesses in future Internet architecture and measurement research.
Broader impact. Consistent with the SaTC program goals, the primary objective of this project is to convert successful research results into a deployed platform to detect and monitor connec- tivity disruption and censorship events on a planetary scale. Situational awareness of the nature and causes of network outages is essential to national decision-makers who must determine the type and extent of proper reponse. Our results will be widely disseminated to research, commer- cial, and government sectors, informing communications and technology policies. The developed tools will enable transformative capabilities providing empirical grounding to substantiate hy- pothesized correlations between technical and socio-political-economic events.
Award ID: 1228994