Visible to the public System Recovery

SoS Newsletter- Advanced Book Block

System Recovery

System recovery following an attack is a core cybersecurity issue. Current research into methods to undo data manipulation and to recover lost or extruded data in distributed, cloud-based or other large scale complex systems is discovering new approaches and methods. The articles cited here are from the first half of 2014.

  • Silei Xu; Runhui Li; Lee, P.P.C.; Yunfeng Zhu; Liping Xiang; Yinlong Xu; Lui, J.C.S., "Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems," Computers, IEEE Transactions on , vol.63, no.4, pp.995,1007, April 2014. In modern parallel storage systems (e.g., cloud storage and data centers), it is important to provide data availability guarantees against disk (or storage node) failures via redundancy coding schemes. One coding scheme is X-code, which is double-fault tolerant while achieving the optimal update complexity. When a disk/node fails, recovery must be carried out to reduce the possibility of data unavailability. We propose an X-code-based optimal recovery scheme called minimum-disk-read-recovery (MDRR), which minimizes the number of disk reads for single-disk failure recovery. We make several contributions. First, we show that MDRR provides optimal single-disk failure recovery and reduces about 25 percent of disk reads compared to the conventional recovery approach. Second, we prove that any optimal recovery scheme for X-code cannot balance disk reads among different disks within a single stripe in general cases. Third, we propose an efficient logical encoding scheme that issues balanced disk read in a group of stripes for any recovery algorithm (including the MDRR scheme). Finally, we implement our proposed recovery schemes and conduct extensive testbed experiments in a networked storage system prototype. Experiments indicate that MDRR reduces around 20 percent of recovery time of the conventional approach, showing that our theoretical findings are applicable in practice.
    Keywords: disc storage; encoding; parallel memories; redundancy; reliability; storage management; system recovery; MDRR; X-code-based optimal recovery scheme; X-code-based parallel storage systems; cloud storage; data availability; data centers; double-fault tolerant coding scheme; logical encoding scheme; minimum-disk-read-recovery; networked storage system prototype; optimal single-disk failure recovery; optimal update complexity; redundancy coding schemes; single disk failure recovery algorithm; Arrays; Complexity theory; Data communication; Encoding; Load management; Peer to peer computing; Reliability; Parallel storage systems; coding theory; data availability; recovery algorithm (ID#:14-2239)
  • Malik, O.A; Senanayake, S.M.N.; Zaheer, D., "An Intelligent Recovery Progress Evaluation System for ACL Reconstructed Subjects Using Integrated 3-D Kinematics and EMG Features," Biomedical and Health Informatics, IEEE Journal of , vol.PP, no.99, pp.1,1,April 2014. An intelligent recovery evaluation system is presented for objective assessment and performance monitoring of anterior cruciate ligament reconstructed (ACL-R) subjects. The system acquires 3-D kinematics of tibiofemoral joint and electromyography (EMG) data from surrounding muscles during various ambulatory and balance testing activities through wireless body-mounted inertial and EMG sensors, respectively. An integrated feature set is generated based on different features extracted from data collected for each activity. The fuzzy clustering and adaptive neuro-fuzzy inference techniques are applied to these integrated feature sets in order to provide different recovery progress assessment indicators (e.g. current stage of recovery, percentage of recovery progress as compared to healthy group etc.) for ACL-R subjects. The system was trained and tested on data collected from a group of healthy and ACL-R subjects. For recovery stage identification, the average testing accuracy of the system was found above 95% (95-99%) for ambulatory activities and above 80% (80-84%) for balance testing activities. The overall recovery evaluation performed by the proposed system was found consistent with the assessment made by the physiotherapists using standard subjective/objective scores. The validated system can potentially be used as a decision supporting tool by physiatrists, physiotherapists and clinicians for quantitative rehabilitation analysis of ACL-R subjects in conjunction with the existing recovery monitoring systems.
    Keywords: (not provided) (ID#:14-2240)
  • Kaczmarek, J.; Wrobel, M.R., "Operating system security by integrity checking and recovery using write-protected storage," Information Security, IET , vol.8, no.2, pp.122,131, March 2014. An integrity checking and recovery (ICAR) system is presented here, which protects file system integrity and automatically restores modified files. The system enables files cryptographic hashes generation and verification, as well as configuration of security constraints. All of the crucial data, including ICAR system binaries, file backups and hashes database are stored in a physically write-protected storage to eliminate the threat of unauthorized modification. A buffering mechanism was designed and implemented in the system to increase operation performance. Additionally, the system supplies user tools for cryptographic hash generation and security database management. The system is implemented as a kernel extension, compliant with the Linux security model. Experimental evaluation of the system was performed and showed an approximate 10% performance degradation in secured file access compared to regular access.
    Keywords: Linux; database management systems; security of data ;ICAR system binaries ;Linux security model; buffering mechanism; cryptographic hashes generation; file backups; file system integrity; hashes database; integrity checking and recovery system; security constraints; security database management; system security; unauthorized modification; write-protected storage (ID#:14-2241)
  • Yunfeng Zhu; Lee, P.P.C.; Yinlong Xu; Yuchong Hu; Liping Xiang, "On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems," Parallel and Distributed Systems, IEEE Transactions on , vol.25, no.7, pp.1830,1840, July 2014. Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach.
    Keywords: fault tolerant computing; storage management; XOR operations; XOR-based erasure codes; availability guarantees; data redundancy; fast recovery solution; hill-climbing technique; large-scale erasure-coded storage systems; networked storage system testbed; node failures; parallelized architecture; replace recovery algorithm; single-node failure recovery; vulnerability window; Algorithm design and analysis;Distributed databases;Encoding;Equations;Generators;Mathematical model;Strips;XOR-coded storage system;recovery algorithm;single-node failure} (ID#:14-2242)
  • Yuankai Chen; Xuan Zeng; Hai Zhou, "Recovery-based resilient latency-insensitive systems," Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014 , vol., no., pp.1,6, 24-28 March 2014. As the interconnect delay is becoming a larger fraction of the clock cycle time, the conventional global stalling mechanism, which is used to correct error in general synchronous circuits, would be no longer feasible because of the expensive timing cost for the stalling signal to travel across the circuit. In this paper, we propose recovery-based resilient latency-insensitive systems (RLISs) that efficiently integrate error-recovery techniques with latency-insensitive design to replace the global stalling. We first demonstrate a baseline RLIS as the motivation of our work that uses additional output buffer which guarantees that only correct data can enter the output channel. However this baseline RLIS suffers from performance degradations even when errors do not occur. We propose a novel improved RLIS that allows erroneous data to propagate in the system. Equipped with improved queues that prevent accumulation of erroneous data, the improved RLIS retains the system performance. We provide theoretical study that analyzes the impact of errors on system performance and the queue sizing problem. We also theoretically prove that the improved RLIS performs no worse than the global stalling mechanism. Experimental results show that the improved RLIS has 40.3% and even 3.1% throughput improvements compared to the baseline RLIS and the infeasible global stalling mechanism respectively, with less than 10% hardware overhead.
    Keywords: clocks; integrated circuit interconnections; logic circuits; RLIS; clock cycle time; error impact; error-recovery; expensive timing cost; global stalling mechanism; improved queues; interconnect delay; queue sizing problem; recovery-based resilient latency-insensitive systems; stalling signal; synchronous circuits; Clocks; Degradation ;Integrated circuit interconnections; Relays; Synchronization; System performance; Throughput (ID#:14-2243)
  • Hong, Bi; Choi, Wan, "Asymptotic Analysis Of Failed Recovery Probability In A Distributed Wireless Storage System With Limited Sum Storage Capacity," Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , vol., no., pp.6459,6463, 4-9 May 2014. In distributed wireless storage systems, failed recovery probability depends on not only wireless channel conditions but also storage size of each distributed storage node. For efficient utilization of limited storage capacity, we asymptotically analyze the failed recovery probability of a distributed wireless storage system with a sum storage capacity constraint when signal-to-noise ratio goes to infinity, and find the optimal storage allocation strategy across distributed storage nodes in terms of the asymptotic failed recovery probability. It is also shown that when the number of storage nodes is sufficiently large the storage size required at each node is not so large for high exponential order of the failed recovery probability.
    Keywords: Distributed storage system; failed recovery; maximum distance separable coding; wireless storage (ID#:14-2244)
  • Sun, J.; Liao, H.; Upadhyaya, B.R., "A Robust Functional-Data-Analysis Method for Data Recovery in Multichannel Sensor Systems," Cybernetics, IEEE Transactions on , vol.44, no.8, pp.1420,1431, Aug. 2014. Multichannel sensor systems are widely used in condition monitoring for effective failure prevention of critical equipment or processes. However, loss of sensor readings due to malfunctions of sensors and/or communication has long been a hurdle to reliable operations of such integrated systems. Moreover, asynchronous data sampling and/or limited data transmission are usually seen in multiple sensor channels. To reliably perform fault diagnosis and prognosis in such operating environments, a data recovery method based on functional principal component analysis (FPCA) can be utilized. However, traditional FPCA methods are not robust to outliers and their capabilities are limited in recovering signals with strongly skewed distributions (i.e., lack of symmetry). This paper provides a robust data-recovery method based on functional data analysis to enhance the reliability of multichannel sensor systems. The method not only considers the possibly skewed distribution of each channel of signal trajectories, but is also capable of recovering missing data for both individual and correlated sensor channels with asynchronous data that may be sparse as well. In particular, grand median functions, rather than classical grand mean functions, are utilized for robust smoothing of sensor signals. Furthermore, the relationship between the functional scores of two correlated signals is modeled using multivariate functional regression to enhance the overall data-recovery capability. An experimental flow-control loop that mimics the operation of coolant-flow loop in a multimodular integral pressurized water reactor is used to demonstrate the effectiveness and adaptability of the proposed data-recovery method. The computational results illustrate that the proposed method is robust to outliers and more capable than the existing FPCA-based method in terms of the accuracy in recovering strongly skewed signals. In addition, turbofan engine data are also analyzed to verify the capability of t- e proposed method in recovering non-skewed signals.
    Keywords: Bandwidth; Data models; Eigenvalues and eigenfunctions; Predictive models; Robustness; Sensor systems; Sun; Asynchronous data; condition monitoring; data recovery; robust functional principal component analysis (ID#:14-2245)
  • Nower, N.; Yasuo Tan; Lim, AO., "Efficient Temporal and Spatial Data Recovery Scheme for Stochastic and Incomplete Feedback Data of Cyber-physical Systems," Service Oriented System Engineering (SOSE), 2014 IEEE 8th International Symposium on , vol., no., pp.192,197, 7-11 April 2014. Feedback loss can severely degrade the overall system performance, in addition, it can affect the control and computation of the Cyber-physical Systems (CPS). CPS hold enormous potential for a wide range of emerging applications including stochastic and time-critical traffic patterns. Stochastic data has a randomness in its nature which make a great challenge to maintain the real-time control whenever the data is lost. In this paper, we propose a data recovery scheme, called the Efficient Temporal and Spatial Data Recovery (ETSDR) scheme for stochastic incomplete feedback of CPS. In this scheme, we identify the temporal model based on the traffic patterns and consider the spatial effect of the nearest neighbor. Numerical results reveal that the proposed ETSDR outperforms both the weighted prediction (WP) and the exponentially weighted moving average (EWMA) algorithm regardless of the increment percentage of missing data in terms of the root mean square error, the mean absolute error, and the integral of absolute error.
    Keywords: data handling; mean square error methods; stochastic processes; CPS; ETSDR scheme; cyber-physical systems; efficient temporal and spatial data recovery; feedback loss; incomplete feedback data; integral of absolute error; mean absolute error; nearest neighbor; real-time control; root mean square error; spatial data recovery scheme; stochastic feedback data; stochastic incomplete feedback; stochastic traffic patterns; system performance; temporal data recovery scheme ;temporal model identification; time-critical traffic patterns; Computational modeling; Correlation; Data models; Mathematical model; Measurement uncertainty; Spatial databases; Stochastic processes; auto regressive integrated moving average; cyber-physical system; data recovery scheme; spatial correlation; stochastic data; temporal correlation (ID#:14-2246)
  • Kyoungwoo Heo, "An Accumulated Loss Recovery Algorithm on Overlay Multicast System Using Fountain Codes," Information Science and Applications (ICISA), 2014 International Conference on , vol., no., pp.1,3, 6-9 May 2014. In this paper, we propose an accumulated loss recovery algorithm on overlay multicast system using Fountain codes. Fountain code successfully decodes the packet loss, but it is weak in accumulated losses on multicast tree. The proposed algorithm overcomes an accumulated loss and significantly reduces delay on overlay multicast tree.
    Keywords: error correction codes; multicast communication; overlay networks; packet radio networks; trees (mathematics);Fountain codes; accumulate loss recovery algorithm; delay reduction; overlay multicast system; overlay multicast tree; packet loss decoding; Decoding; Delays; Encoding; Overlay networks; Packet loss; Simulation (ID#:14-2247)
  • Beraud-Sudreau, Q.; Begueret, J.-B.; Mazouffre, O.; Pignol, M.; Baguena, L.; Neveu, C.; Deval, Y.; Taris, T., "SiGe Clock and Data Recovery System Based on Injection-Locked Oscillator for 100 Gbit/s Serial Data Link," Solid-State Circuits, IEEE Journal of, vol. PP, no.99, pp.1,10, 30 April 2014. Clock and data recovery (CDR) systems are the first logic blocks in serial data receivers and the latter's performance depends on the CDR. In this paper, a 100 Gbit/s CDR designed in 130 nm BiCMOS SiGe is presented. The CDR uses an injection locked oscillator (ILO) which delivers the 100 GHz clock. The inherent phase shift between the recovered clock and the incoming data is compensated by a feedback loop which performs phase and frequency tracking. Furthermore, a windowed phase comparator has been used, first to lower the classical number of gates, in order to prevent any delay skews between the different phase detector blocks, then to decrease the phase comparator operating frequency, and furthermore to extend the ability to track zero bit patterns The measurements results demonstrate a 100 GHz clock signal extracted from 50 Gb/s input data, with a phase noise as low as $-$98 dBc/Hz at 100 kHz offset from the carrier frequency. The rms jitter of the 25 GHz recovered data is only 1.2 ps. The power consumption is 1.4 W under 2.3 V power supply.
    Keywords: 100 Gb/s; BiCMOS SiGe ;clock and data recovery (CDR); injection-locked oscillator (ILO); millimeter-wave data communication; phase comparator; phase-locked loop (PLL) (ID#:14-2248)
  • Xinhai Zhang; Persson, M.; Nyberg, M.; Mokhtari, B.; Einarson, A; Linder, H.; Westman, J.; DeJiu Chen; Torngren, M., "Experience on Applying Software Architecture Recovery To Automotive Embedded Systems," Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on , vol., no., pp.379,382, 3-6 Feb. 2014. The importance and potential advantages with a comprehensive product architecture description are well described in the literature. However, developing such a description takes additional resources, and it is difficult to maintain consistency with evolving implementations. This paper presents an approach and industrial experience which is based on architecture recovery from source code at truck manufacturer Scania CV AB. The extracted representation of the architecture is presented in several views and verified on CAN signal level. Lessons learned are discussed.
    Keywords: automobile industry; embedded systems; software architecture; source code (software); CAN signal level; Scania CV AB; automotive embedded systems; comprehensive product architecture description; extracted representation ;software architecture recovery; source code; truck manufacturer; Automotive engineering; Browsers; Computer architecture; Databases; Embedded systems; Software architecture; architecture recovery; automotive industry; distributed embedded systems; software engineering (ID#:14-2249)
  • Xiang Zhou, "Efficient Clock and Carrier Recovery Algorithms for Single-Carrier Coherent Optical Systems: A systematic review on challenges and recent progress," Signal Processing Magazine, IEEE , vol.31, no.2, pp.35,45, March 2014. This article presents a systematic review on the challenges and recent progress of timing and carrier synchronization techniques for high-speed optical transmission systems using single-carrier-based coherent optical modulation formats.
    Keywords: optical communication; optical modulation; synchronization; carrier recovery algorithm; carrier synchronization technique; clock recovery algorithm; high-speed optical transmission system; single-carrier-based coherent optical modulation format; timing synchronization technique; Clocks; Digital signal processing; High-speed optical techniques; Optical distortion; Optical receivers; Optical signal processing; Signal processing algorithms; Timing} (ID#:14-2250)
  • Stephens, B.; Cox, AL.; Singla, A; Carter, J.; Dixon, C.; Felter, W., "Practical DCB For Improved Data Center Networks," INFOCOM, 2014 Proceedings IEEE, vol., no., pp.1824,1832, April 27 2014-May 2 2014. Storage area networking is driving commodity data center switches to support lossless Ethernet (DCB). Unfortunately, to enable DCB for all traffic on arbitrary network topologies, we must address several problems that can arise in lossless networks, e.g., large buffering delays, unfairness, head of line blocking, and deadlock. We propose TCP-Bolt, a TCP variant that not only addresses the first three problems but reduces flow completion times by as much as 70%. We also introduce a simple, practical deadlock-free routing scheme that eliminates deadlock while achieving aggregate network throughput within 15% of ECMP routing. This small compromise in potential routing capacity is well worth the gains in flow completion time. We note that our results on deadlock-free routing are also of independent interest to the storage area networking community. Further, as our hardware testbed illustrates, these gains are achievable today, without hardware changes to switches or NICs.
    Keywords: computer centers; local area networks; routing protocols; switching networks; telecommunication network topology ;telecommunication traffic; transport protocols; DCB; ECMP routing; NIC; TCP-bolt; arbitrary network topology traffic; buffering delay; commodity data center switch; data center bridging; deadlock-free routing scheme ;improved data center network; line blocking head ;lossless Ethernet; storage area networking; Hardware; Ports (Computers); Routing; System recovery; Throughput; Topology; Vegetation (ID#:14-2251)
  • Chieh-Hao Chang; Jung-Chun Kao; Fu-Wen Chen; Shih Hsun Cheng, "Many-to-All Priority-Based Network-Coding Broadcast In Wireless Multihop Networks," Wireless Telecommunications Symposium (WTS), 2014 , vol., no., pp.1,6, 9-11 April 2014. This paper addresses the minimum transmission broadcast (MTB) problem for the many-to-all scenario in wireless multihop networks and presents a network-coding broadcast protocol with priority-based deadlock prevention. Our main contributions are as follows: First, we relate the many-to-all-with-network-coding MTB problem to a maximum out-degree problem. The solution of the latter can serve as a lower bound for the number of transmissions. Second, we propose a distributed network-coding broadcast protocol, which constructs efficient broadcast trees and dictates nodes to transmit packets in a network coding manner. Besides, we present the priority-based deadlock prevention mechanism to avoid deadlocks. Simulation results confirm that compared with existing protocols in the literature and the performance bound we present, our proposed network-coding broadcast protocol performs very well in terms of the number of transmissions.
    Keywords: network coding; protocols; radio networks; telecommunication network topology ;trees (mathematics);broadcast trees; distributed many-to-all priority-based network-coding broadcast protocol; energy efficiency; many-to-all- with-network-coding MTB problem; maximum out-degree problem; minimum transmission broadcast problem; packet transmission; priority-based deadlock prevention; wireless multihop networks; Encoding; Network coding; Protocols; System recovery; Topology; Vectors; Wireless communication; broadcast; energy efficiency; network coding; wireless networks (ID#:14-2252)
  • Verbeek, F.; Schmaltz, J., "A Decision Procedure for Deadlock-Free Routing in Wormhole Networks," Parallel and Distributed Systems, IEEE Transactions on , vol.25, no.8, pp.1935,1944, Aug. 2014. Deadlock freedom is a key challenge in the design of communication networks. Wormhole switching is a popular switching technique, which is also prone to deadlocks. Deadlock analysis of routing functions is a manual and complex task. We propose an algorithm that automatically proves routing functions deadlock-free or outputs a minimal counter-example explaining the source of the deadlock. Our algorithm is the first to automatically check a necessary and sufficient condition for deadlock-free routing. We illustrate its efficiency in a complex adaptive routing function for torus topologies. Results are encouraging. Deciding deadlock freedom is co-NP-Complete for wormhole networks. Nevertheless, our tool proves a 13 x 13 torus deadlock-free within seconds. Finding minimal deadlocks is more difficult. Our tool needs four minutes to find a minimal deadlock in a 11 x 11 torus while it needs nine hours for a 12 x 12 network.
    Keywords: computational complexity; computer networks; integer programming; linear programming; telecommunication network routing; telecommunication network topology; adaptive routing function; co-NP-complete problem; communication network design; deadlock freedom; deadlock-free routing; decision procedure; necessary condition; routing functions; sufficient condition; torus topologies; wormhole networks; wormhole switching technique; Design methodology; Grippers; Network topology; Routing; Switches; System recovery; Topology; Communication networks; automatic verification; deadlocks; formal methods; routing protocols (ID#:14-2253)
  • Hardy, T.L., "Resilience: A holistic safety approach," Reliability and Maintainability Symposium (RAMS), 2014 Annual , vol., no., pp.1,6, 27-30 Jan. 2014. Decreasing the potential for catastrophic consequences poses a significant challenge for high-risk industries. Organizations are under many different pressures, and they are continuously trying to adapt to changing conditions and recover from disturbances and stresses that can arise from both normal operations and unexpected events. Reducing risks in complex systems therefore requires that organizations develop and enhance traits that increase resilience. Resilience provides a holistic approach to safety, emphasizing the creation of organizations and systems that are proactive, interactive, reactive, and adaptive. This approach relies on disciplines such as system safety and emergency management, but also requires that organizations develop indicators and ways of knowing when an emergency is imminent. A resilient organization must be adaptive, using hands-on activities and lessons learned efforts to better prepare it to respond to future disruptions. It is evident from the discussions of each of the traits of resilience, including their limitations, that there are no easy answers to reducing safety risks in complex systems. However, efforts to strengthen resilience may help organizations better address the challenges associated with the ever-increasing complexities of their systems.
    Keywords: emergency management; large-scale systems; reliability; risk management; safety; system recovery ;complex systems; emergency management; high-risk industries; holistic safety approach; resilience; system recovery; system risk reduction; system safety; Accidents; Hazards; Organizations; Personnel; Resilience; Systematics; emergency management; resilience; system safety (ID#:14-2254)


Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to SoS.Project (at) for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.