Visible to the public Biblio

Filters: Keyword is speech processing  [Clear All Filters]
Tsaknakis, Ioannis, Hong, Mingyi, Liu, Sijia.  2020.  Decentralized Min-Max Optimization: Formulations, Algorithms and Applications in Network Poisoning Attack. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :5755–5759.
This paper discusses formulations and algorithms which allow a number of agents to collectively solve problems involving both (non-convex) minimization and (concave) maximization operations. These problems have a number of interesting applications in information processing and machine learning, and in particular can be used to model an adversary learning problem called network data poisoning. We develop a number of algorithms to efficiently solve these non-convex min-max optimization problems, by combining techniques such as gradient tracking in the decentralized optimization literature and gradient descent-ascent schemes in the min-max optimization literature. Also, we establish convergence to a first order stationary point under certain conditions. Finally, we perform experiments to demonstrate that the proposed algorithms are effective in the data poisoning attack.
Li, Xu, Zhong, Jinghua, Wu, Xixin, Yu, Jianwei, Liu, Xunying, Meng, Helen.  2020.  Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :6579—6583.
This work investigates the vulnerability of Gaussian Mixture Model (GMM) i-vector based speaker verification systems to adversarial attacks, and the transferability of adversarial samples crafted from GMM i-vector based systems to x-vector based systems. In detail, we formulate the GMM i-vector system as a scoring function of enrollment and testing utterance pairs. Then we leverage the fast gradient sign method (FGSM) to optimize testing utterances for adversarial samples generation. These adversarial samples are used to attack both GMM i-vector and x-vector systems. We measure the system vulnerability by the degradation of equal error rate and false acceptance rate. Experiment results show that GMM i-vector systems are seriously vulnerable to adversarial attacks, and the crafted adversarial samples are proved to be transferable and pose threats to neural network speaker embedding based systems (e.g. x-vector systems).
Elvira, Clément, Herzet, Cédric.  2020.  Short and Squeezed: Accelerating the Computation of Antisparse Representations with Safe Squeezing. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :5615—5619.
Antisparse coding aims at spreading the information uniformly over representation coefficients and can be expressed as the solution of an ℓ∞-norm regularized problem. In this paper, we propose a new methodology, coined "safe squeezing", accelerating the computation of antisparse representations. The idea consists in identifying saturated entries of the solution via simple tests and compacting their contribution to achieve some form of dimensionality reduction. Numerical experiments show that the proposed approach leads to significant computational gain.
Zhang, S., Ma, X..  2020.  A General Difficulty Control Algorithm for Proof-of-Work Based Blockchains. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :3077–3081.
Designing an efficient difficulty control algorithm is an essential problem in Proof-of-Work (PoW) based blockchains because the network hash rate is randomly changing. This paper proposes a general difficulty control algorithm and provides insights for difficulty adjustment rules for PoW based blockchains. The proposed algorithm consists a two-layer neural network. It has low memory cost, meanwhile satisfying the fast-updating and low volatility requirements for difficulty adjustment. Real data from Ethereum are used in the simulations to prove that the proposed algorithm has better performance for the control of the block difficulty.
Zarazaga, P. P., Bäckström, T., Sigg, S..  2020.  Acoustic Fingerprints for Access Management in Ad-Hoc Sensor Networks. IEEE Access. 8:166083—166094.

Voice user interfaces can offer intuitive interaction with our devices, but the usability and audio quality could be further improved if multiple devices could collaborate to provide a distributed voice user interface. To ensure that users' voices are not shared with unauthorized devices, it is however necessary to design an access management system that adapts to the users' needs. Prior work has demonstrated that a combination of audio fingerprinting and fuzzy cryptography yields a robust pairing of devices without sharing the information that they record. However, the robustness of these systems is partially based on the extensive duration of the recordings that are required to obtain the fingerprint. This paper analyzes methods for robust generation of acoustic fingerprints in short periods of time to enable the responsive pairing of devices according to changes in the acoustic scenery and can be integrated into other typical speech processing tools.

Huang, Y., Wang, S., Wang, Y., Li, H..  2020.  A New Four-Dimensional Chaotic System and Its Application in Speech Encryption. 2020 Information Communication Technologies Conference (ICTC). :171–175.
Traditional encryption algorithms are not suitable for modern mass speech situations, while some low-dimensional chaotic encryption algorithms are simple and easy to implement, but their key space often small, leading to poor security, so there is still a lot of room for improvement. Aiming at these problems, this paper proposes a new type of four-dimensional chaotic system and applies it to speech encryption. Simulation results show that the encryption scheme in this paper has higher key space and security, which can achieve the speech encryption goal.
Huang, Y., Wang, Y..  2019.  Multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band variance. 2019 3rd International Conference on Electronic Information Technology and Computer Engineering (EITCE). :243—251.

In order to solve the problems of the existing speech content authentication algorithm, such as single format, ununiversal algorithm, low security, low accuracy of tamper detection and location in small-scale, a multi-format speech perception hashing based on time-frequency parameter fusion of energy zero ratio and frequency band bariance is proposed. Firstly, the algorithm preprocesses the processed speech signal and calculates the short-time logarithmic energy, zero-crossing rate and frequency band variance of each speech fragment. Then calculate the energy to zero ratio of each frame, perform time- frequency parameter fusion on time-frequency features by mean filtering, and the time-frequency parameters are constructed by difference hashing method. Finally, the hash sequence is scrambled with equal length by logistic chaotic map, so as to improve the security of the hash sequence in the transmission process. Experiments show that the proposed algorithm is robustness, discrimination and key dependent.

Su, Jinsong, Zeng, Jiali, Xiong, Deyi, Liu, Yang, Wang, Mingxuan, Xie, Jun.  2018.  A Hierarchy-to-Sequence Attentional Neural Machine Translation Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 26:623—632.

Although sequence-to-sequence attentional neural machine translation (NMT) has achieved great progress recently, it is confronted with two challenges: learning optimal model parameters for long parallel sentences and well exploiting different scopes of contexts. In this paper, partially inspired by the idea of segmenting a long sentence into short clauses, each of which can be easily translated by NMT, we propose a hierarchy-to-sequence attentional NMT model to handle these two challenges. Our encoder takes the segmented clause sequence as input and explores a hierarchical neural network structure to model words, clauses, and sentences at different levels, particularly with two layers of recurrent neural networks modeling semantic compositionality at the word and clause level. Correspondingly, the decoder sequentially translates segmented clauses and simultaneously applies two types of attention models to capture contexts of interclause and intraclause for translation prediction. In this way, we can not only improve parameter learning, but also well explore different scopes of contexts for translation. Experimental results on Chinese-English and English-German translation demonstrate the superiorities of the proposed model over the conventional NMT model.

Kassim, Sarah, Megherbi, Ouerdia, Hamiche, Hamid, Djennoune, Saïd, Bettayeb, Maamar.  2019.  Speech encryption based on the synchronization of fractional-order chaotic maps. 2019 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). :1–6.
This work presents a new method of encrypting and decrypting speech based on a chaotic key generator. The proposed scheme takes advantage of the best features of chaotic systems. In the proposed method, the input speech signal is converted into an image which is ciphered by an encryption function using a chaotic key matrix generated from a fractional-order chaotic map. Based on a deadbeat observer, the exact synchronization of system used is established, and the decryption is performed. Different analysis are applied for analyzing the effectiveness of the encryption system. The obtained results confirm that the proposed system offers a higher level of security against various attacks and holds a strong key generation mechanism for satisfactory speech communication.
Khomytska, Iryna, Teslyuk, Vasyl.  2019.  Mathematical Methods Applied for Authorship Attribution on the Phonological Level. 2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT). 3:7—11.

The proposed combination of statistical methods has proved efficient for authorship attribution. The complex analysis method based on the proposed combination of statistical methods has made it possible to minimize the number of phoneme groups by which the authorial differentiation of texts has been done.

Dai, Haipeng, Liu, Alex X., Li, Zeshui, Wang, Wei, Zhang, Fengmin, Dong, Chao.  2019.  Recognizing Driver Talking Direction in Running Vehicles with a Smartphone. 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). :10–18.
This paper addresses the fundamental problem of identifying driver talking directions using a single smartphone, which can help drivers by warning distraction of having conversations with passengers in a vehicle and enable safety enhancement. The basic idea of our system is to perform talking status and direction identification using two microphones on a smartphone. We first use the sound recorded by the two microphones to identify whether the driver is talking or not. If yes, we then extract the so-called channel fingerprint from the speech signal and classify it into one of three typical driver talking directions, namely, front, right and back, using a trained model obtained in advance. The key novelty of our scheme is the proposition of channel fingerprint which leverages the heavy multipath effects in the harsh in-vehicle environment and cancels the variability of human voice, both of which combine to invalidate traditional TDoA, DoA and fingerprint based sound source localization approaches. We conducted extensive experiments using two kinds of phones and two vehicles for four phone placements in three representative scenarios, and collected 23 hours voice data from 20 participants. The results show that our system can achieve 95.0% classification accuracy on average.
Chai, Yadeng, Liu, Yong.  2019.  Natural Spoken Instructions Understanding for Robot with Dependency Parsing. 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER). :866–871.
This paper presents a method based on syntactic information, which can be used for intent determination and slot filling tasks in a spoken language understanding system including the spoken instructions understanding module for robot. Some studies in recent years attempt to solve the problem of spoken language understanding via syntactic information. This research is a further extension of these approaches which is based on dependency parsing. In this model, the input for neural network are vectors generated by a dependency parsing tree, which we called window vector. This vector contains dependency features that improves performance of the syntactic-based model. The model has been evaluated on the benchmark ATIS task, and the results show that it outperforms many other syntactic-based approaches, especially in terms of slot filling, it has a performance level on par with some state of the art deep learning algorithms in recent years. Also, the model has been evaluated on FBM3, a dataset of the RoCKIn@Home competition. The overall rate of correctly understanding the instructions for robot is quite good but still not acceptable in practical use, which is caused by the small scale of FBM3.
Karve, Shreya, Nagmal, Arati, Papalkar, Sahil, Deshpande, S. A..  2018.  Context Sensitive Conversational Agent Using DNN. 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA). :475–478.
We investigate a method of building a closed domain intelligent conversational agent using deep neural networks. A conversational agent is a dialog system intended to converse with a human, with a coherent structure. Our conversational agent uses a retrieval based model that identifies the intent of the input user query and maps it to a knowledge base to return appropriate results. Human conversations are based on context, but existing conversational agents are context insensitive. To overcome this limitation, our system uses a simple stack based context identification and storage system. The conversational agent generates responses according to the current context of conversation. allowing more human-like conversations.
Jawad, Ameer K., Abdullah, Hikmat N., Hreshee, Saad S..  2018.  Secure speech communication system based on scrambling and masking by chaotic maps. 2018 International Conference on Advance of Sustainable Engineering and its Application (ICASEA). :7–12.
As a result of increasing the interest in developing the communication systems that use public channels for transmitting information, many channel problems are raised up. Among these problems, the important one should be addressed is the information security. This paper presents a proposed communication system with high security uses two encryption levels based on chaotic systems. The first level is chaotic scrambling, while the second one is chaotic masking. This configuration increases the information security since the key space becomes too large. The MATLAB simulation results showed that the Segmental Spectral Signal to Noise Ratio (SSSNR) of the first level (chaotic scrambling) is reduced by -5.195 dB comparing to time domain scrambling. Furthermore, in the second level (chaotic masking), the SSSNR is reduced by -20.679 dB. It is also showed that when the two levels are combined, the overall reduction obtained is -21.755 dB.
Yao, Y., Xiao, B., Wu, G., Liu, X., Yu, Z., Zhang, K., Zhou, X..  2017.  Voiceprint: A Novel Sybil Attack Detection Method Based on RSSI for VANETs. 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). :591–602.

Vehicular Ad Hoc Networks (VANETs) enable vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications that bring many benefits and conveniences to improve the road safety and drive comfort in future transportation systems. Sybil attack is considered one of the most risky threats in VANETs since a Sybil attacker can generate multiple fake identities with false messages to severely impair the normal functions of safety-related applications. In this paper, we propose a novel Sybil attack detection method based on Received Signal Strength Indicator (RSSI), Voiceprint, to conduct a widely applicable, lightweight and full-distributed detection for VANETs. To avoid the inaccurate position estimation according to predefined radio propagation models in previous RSSI-based detection methods, Voiceprint adopts the RSSI time series as the vehicular speech and compares the similarity among all received time series. Voiceprint does not rely on any predefined radio propagation model, and conducts independent detection without the support of the centralized infrastructure. It has more accurate detection rate in different dynamic environments. Extensive simulations and real-world experiments demonstrate that the proposed Voiceprint is an effective method considering the cost, complexity and performance.

Badii, A., Faulkner, R., Raval, R., Glackin, C., Chollet, G..  2017.  Accelerated Encryption Algorithms for Secure Storage and Processing in the Cloud. 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). :1–6.

The objective of this paper is to outline the design specification, implementation and evaluation of a proposed accelerated encryption framework which deploys both homomorphic and symmetric-key encryptions to serve the privacy preserving processing; in particular, as a sub-system within the Privacy Preserving Speech Processing framework architecture as part of the PPSP-in-Cloud Platform. Following a preliminary study of GPU efficiency gains optimisations benchmarked for AES implementation we have addressed and resolved the Big Integer processing challenges in parallel implementation of bilinear pairing thus enabling the creation of partially homomorphic encryption schemes which facilitates applications such as speech processing in the encrypted domain on the cloud. This novel implementation has been validated in laboratory tests using a standard speech corpus and can be used for other application domains to support secure computation and privacy preserving big data storage/processing in the cloud.

Shimauchi, S., Ohmuro, H..  2014.  Accurate adaptive filtering in square-root Hann windowed short-time fourier transform domain. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :1305-1309.

A novel short-time Fourier transform (STFT) domain adaptive filtering scheme is proposed that can be easily combined with nonlinear post filters such as residual echo or noise reduction in acoustic echo cancellation. Unlike normal STFT subband adaptive filters, which suffers from aliasing artifacts due to its poor prototype filter, our scheme achieves good accuracy by exploiting the relationship between the linear convolution and the poor prototype filter, i.e., the STFT window function. The effectiveness of our scheme was confirmed through the results of simulations conducted to compare it with conventional methods.

Rafii, Z., Coover, B., Jinyu Han.  2014.  An audio fingerprinting system for live version identification using image processing techniques. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :644-648.

Suppose that you are at a music festival checking on an artist, and you would like to quickly know about the song that is being played (e.g., title, lyrics, album, etc.). If you have a smartphone, you could record a sample of the live performance and compare it against a database of existing recordings from the artist. Services such as Shazam or SoundHound will not work here, as this is not the typical framework for audio fingerprinting or query-by-humming systems, as a live performance is neither identical to its studio version (e.g., variations in instrumentation, key, tempo, etc.) nor it is a hummed or sung melody. We propose an audio fingerprinting system that can deal with live version identification by using image processing techniques. Compact fingerprints are derived using a log-frequency spectrogram and an adaptive thresholding method, and template matching is performed using the Hamming similarity and the Hough Transform.

Andrade Esquef, P.A., Apolinario, J.A., Biscainho, L.W.P..  2014.  Edit Detection in Speech Recordings via Instantaneous Electric Network Frequency Variations. Information Forensics and Security, IEEE Transactions on. 9:2314-2326.

In this paper, an edit detection method for forensic audio analysis is proposed. It develops and improves a previous method through changes in the signal processing chain and a novel detection criterion. As with the original method, electrical network frequency (ENF) analysis is central to the novel edit detector, for it allows monitoring anomalous variations of the ENF related to audio edit events. Working in unsupervised manner, the edit detector compares the extent of ENF variations, centered at its nominal frequency, with a variable threshold that defines the upper limit for normal variations observed in unedited signals. The ENF variations caused by edits in the signal are likely to exceed the threshold providing a mechanism for their detection. The proposed method is evaluated in both qualitative and quantitative terms via two distinct annotated databases. Results are reported for originally noisy database signals as well as versions of them further degraded under controlled conditions. A comparative performance evaluation, in terms of equal error rate (EER) detection, reveals that, for one of the tested databases, an improvement from 7% to 4% EER is achieved, respectively, from the original to the new edit detection method. When the signals are amplitude clipped or corrupted by broadband background noise, the performance figures of the novel method follow the same profile of those of the original method.