Visible to the public Biblio

Filters: Keyword is multi-threading  [Clear All Filters]
2014
Chi Sing Chum, Changha Jun, Xiaowen Zhang.  2014.  Implementation of randomize-then-combine constructed hash function. Wireless and Optical Communication Conference (WOCC), 2014 23rd. :1-6.

Hash functions, such as SHA (secure hash algorithm) and MD (message digest) families that are built upon Merkle-Damgard construction, suffer many attacks due to the iterative nature of block-by-block message processing. Chum and Zhang [4] proposed a new hash function construction that takes advantage of the randomize-then-combine technique, which was used in the incremental hash functions, to the iterative hash function. In this paper, we implement such hash construction in three ways distinguished by their corresponding padding methods. We conduct the experiment in parallel multi-threaded programming settings. The results show that the speed of proposed hash function is no worse than SHA1.
 

Gimenez, A., Gamblin, T., Rountree, B., Bhatele, A., Jusufi, I., Bremer, P.-T., Hamann, B..  2014.  Dissecting On-Node Memory Access Performance: A Semantic Approach. High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for. :166-176.

Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.
 

2017
Rolinger, T. B., Simon, T. A., Krieger, C. D..  2017.  Performance challenges for heterogeneous distributed tensor decompositions. 2017 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.

Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.

2018
Huang, Bo-Yuan, Ray, Sayak, Gupta, Aarti, Fung, Jason M., Malik, Sharad.  2018.  Formal Security Verification of Concurrent Firmware in SoCs Using Instruction-Level Abstraction for Hardware*. 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). :1-6.

Formal security verification of firmware interacting with hardware in modern Systems-on-Chip (SoCs) is a critical research problem. This faces the following challenges: (1) design complexity and heterogeneity, (2) semantics gaps between software and hardware, (3) concurrency between firmware/hardware and between Intellectual Property Blocks (IPs), and (4) expensive bit-precise reasoning. In this paper, we present a co-verification methodology to address these challenges. We model hardware using the Instruction-Level Abstraction (ILA), capturing firmware-visible behavior at the architecture level. This enables integrating hardware behavior with firmware in each IP into a single thread. The co-verification with multiple firmware across IPs is formulated as a multi-threaded program verification problem, for which we leverage software verification techniques. We also propose an optimization using abstraction to prevent expensive bit-precise reasoning. The evaluation of our methodology on an industry SoC Secure Boot design demonstrates its applicability in SoC security verification.

Huang, Jeff.  2018.  UFO: Predictive Concurrency Use-After-Free Detection. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). :609-619.

Use-After-Free (UAF) vulnerabilities are caused by the program operating on a dangling pointer and can be exploited to compromise critical software systems. While there have been many tools to mitigate UAF vulnerabilities, UAF remains one of the most common attack vectors. UAF is particularly di cult to detect in concurrent programs, in which a UAF may only occur with rare thread schedules. In this paper, we present a novel technique, UFO, that can precisely predict UAFs based on a single observed execution trace with a provably higher detection capability than existing techniques with no false positives. The key technical advancement of UFO is an extended maximal thread causality model that captures the largest possible set of feasible traces that can be inferred from a given multithreaded execution trace. By formulating UAF detection as a constraint solving problem atop this model, we can explore a much larger thread scheduling space than classical happens-before based techniques. We have evaluated UFO on several real-world large complex C/C++ programs including Chromium and FireFox. UFO scales to real-world systems with hundreds of millions of events in their execution and has detected a large number of real concurrency UAFs.

2019
Chondamrongkul, Nacha, Sun, Jing, Wei, Bingyang, Warren, Ian.  2019.  Parallel Verification of Software Architecture Design. 2019 IEEE 19th International Symposium on High Assurance Systems Engineering (HASE). :50–57.
In the component-based software system, certain behaviours of components and their composition may affect system reliability at runtime. This problem can be early detected through the automated verification of software architecture design, by which model checking is one of the techniques to achieve this. However, its practicality and performance issue remain challenges. This paper presents a scalable approach for the software architecture verification. The modelling is proposed to manifest the behaviours in the software component, in order to detect problematic behaviours, such as circular dependency and performance bottleneck. The outcome of the verification identifies the problem and the scenarios that cause it. In order to mitigate the verification performance issue, the parallelism is applied to the verification process so that multiple decomposed models can be simultaneously verified on a multi-threaded environment. As some software systems are designed as the monolithic architecture, we present a method that helps to automatically decompose a large monolithic model into a set of smaller sub-models. Our approach was evaluated and proved to enhance the performance of the verification process for the large-scale complex software systems.
Tang, Deyou, Zhang, Yazhuo, Zeng, Qingmiao.  2019.  Optimization of Hardware-oblivious and Hardware-conscious Hash-join Algorithms on KNL. 2019 4th International Conference on Cloud Computing and Internet of Things (CCIOT). :24–28.
Investigation of hash join algorithm on multi-core and many-core platforms showed that carefully tuned hash join implementations could outperform simple hash joins on most multi-core servers. However, hardware-oblivious hash join has shown competitive performance on many-core platforms. Knights Landing (KNL) has received attention in the field of parallel computing for its massively data-parallel nature and high memory bandwidth, but both hardware-oblivious and hardware-conscious hash join algorithms have not been systematically discussed and evaluated for KNL's characteristics (high bandwidth, cluster mode, etc.). In this paper, we present the design and implementation of the state-of-the-art hardware-oblivious and hardware-conscious hash joins that are tuned to exploit various KNL hardware characteristics. Using a thorough evaluation, we show that:1) Memory allocation strategies based on KNL's architecture are effective for both hardware-oblivious and hardware-conscious hash join algorithms; 2) In order to improve the efficiency of the hash join algorithms, hardware architecture features are still non-negligible factors.
Meng, Ruijie, Zhu, Biyun, Yun, Hao, Li, Haicheng, Cai, Yan, Yang, Zijiang.  2019.  CONVUL: An Effective Tool for Detecting Concurrency Vulnerabilities. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). :1154—1157.

Concurrency vulnerabilities are extremely harmful and can be frequently exploited to launch severe attacks. Due to the non-determinism of multithreaded executions, it is very difficult to detect them. Recently, data race detectors and techniques based on maximal casual model have been applied to detect concurrency vulnerabilities. However, the former are ineffective and the latter report many false negatives. In this paper, we present CONVUL, an effective tool for concurrency vulnerability detection. CONVUL is based on exchangeable events, and adopts novel algorithms to detect three major kinds of concurrency vulnerabilities. In our experiments, CONVUL detected 9 of 10 known vulnerabilities, while other tools only detected at most 2 out of these 10 vulnerabilities. The 10 vulnerabilities are available at https://github.com/mryancai/ConVul.

2020
Li, R., Ishimaki, Y., Yamana, H..  2020.  Privacy Preserving Calculation in Cloud using Fully Homomorphic Encryption with Table Lookup. 2020 5th IEEE International Conference on Big Data Analytics (ICBDA). :315–322.
To protect data in cloud servers, fully homomorphic encryption (FHE) is an effective solution. In addition to encrypting data, FHE allows a third party to evaluate arithmetic circuits (i.e., computations) over encrypted data without decrypting it, guaranteeing protection even during the calculation. However, FHE supports only addition and multiplication. Functions that cannot be directly represented by additions or multiplications cannot be evaluated with FHE. A naïve implementation of such arithmetic operations with FHE is a bit-wise operation that encrypts numerical data as a binary string. This incurs huge computation time and storage costs, however. To overcome this limitation, we propose an efficient protocol to evaluate multi-input functions with FHE using a lookup table. We extend our previous work, which evaluates a single-integer input function, such as f(x). Our extended protocol can handle multi-input functions, such as f(x,y). Thus, we propose a new method of constructing lookup tables that can evaluate multi-input functions to handle general functions. We adopt integer encoding rather than bit-wise encoding to speed up the evaluations. By adopting both permutation operations and a private information retrieval scheme, we guarantee that no information from the underlying plaintext is leaked between two parties: a cloud computation server and a decryptor. Our experimental results show that the runtime of our protocol for a two-input function is approximately 13 minutes, when there are 8,192 input elements in the lookup table. By adopting a multi-threading technique, the runtime can be further reduced to approximately three minutes with eight threads. Our work is more practical than a previously proposed bit-wise implementation, which requires 60 minutes to evaluate a single-input function.