Visible to the public Biblio

Found 369 results

Filters: Keyword is Big Data  [Clear All Filters]
Liu, Yizhong, Xia, Yu, Liu, Jianwei, Hei, Yiming.  2021.  A Secure and Decentralized Reconfiguration Protocol For Sharding Blockchains. 2021 7th IEEE Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :111–116.
Most present reconfiguration methods in sharding blockchains rely on a secure randomness, whose generation might be complicated. Besides, a reference committee is usually in charge of the reconfiguration, making the process not decentralized. To address the above issues, this paper proposes a secure and decentralized shard reconfiguration protocol, which allows each shard to complete the selection and confirmation of its own shard members in turn. The PoW mining puzzle is calculated using the public key hash value in the member list confirmed by the last shard. Through the mining and shard member list commitment process, each shard can update its members safely and efficiently once in a while. Furthermore, it is proved that our protocol satisfies the safety, consistency, liveness, and decentralization properties. The honest member proportion in each confirmed shard member list is guaranteed to exceed a certain safety threshold, and all honest nodes have an identical view on the list. The reconfiguration is ensured to make progress, and each node has the same right to participate in the process. Our secure and decentralized shard reconfiguration protocol could be applied to all committee-based sharding blockchains.
Zhang, Xiangyu, Yang, Jianfeng, Li, Xiumei, Liu, Minghao, Kang, Ruichun, Wang, Runmin.  2021.  Deeply Multi-channel guided Fusion Mechanism for Natural Scene Text Detection. 2021 7th International Conference on Big Data and Information Analytics (BigDIA). :149–156.
Scene text detection methods have developed greatly in the past few years. However, due to the limitation of the diversity of the text background of natural scene, the previous methods often failed when detecting more complicated text instances (e.g., super-long text and arbitrarily shaped text). In this paper, a text detection method based on multi -channel bounding box fusion is designed to address the problem. Firstly, the convolutional neural network is used as the basic network for feature extraction, including shallow text feature map and deep semantic text feature map. Secondly, the whole convolutional network is used for upsampling of feature map and fusion of feature map at each layer, so as to obtain pixel-level text and non-text classification results. Then, two independent text detection boxes channels are designed: the boundary box regression channel and get the bounding box directly on the score map channel. Finally, the result is obtained by combining multi-channel boundary box fusion mechanism with the detection box of the two channels. Experiments on ICDAR2013 and ICDAR2015 demonstrate that the proposed method achieves competitive results in scene text detection.
Zhang, Cheng, Yamana, Hayato.  2021.  Improving Text Classification Using Knowledge in Labels. 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA). :193–197.
Various algorithms and models have been proposed to address text classification tasks; however, they rarely consider incorporating the additional knowledge hidden in class labels. We argue that hidden information in class labels leads to better classification accuracy. In this study, instead of encoding the labels into numerical values, we incorporated the knowledge in the labels into the original model without changing the model architecture. We combined the output of an original classification model with the relatedness calculated based on the embeddings of a sequence and a keyword set. A keyword set is a word set to represent knowledge in the labels. Usually, it is generated from the classes while it could also be customized by the users. The experimental results show that our proposed method achieved statistically significant improvements in text classification tasks. The source code and experimental details of this study can be found on Github11
Kuilboer, Jean-Pierre, Stull, Tristan.  2021.  Text Analytics and Big Data in the Financial domain. 2021 16th Iberian Conference on Information Systems and Technologies (CISTI). :1–4.
This research attempts to provide some insights on the application of text mining and Natural Language Processing (NLP). The application domain is consumer complaints about financial institutions in the USA. As an advanced analytics discipline embedded within the Big Data paradigm, the practice of text analytics contains elements of emergent knowledge processes. Since our experiment should be able to scale up we make use of a pipeline based on Spark-NLP. The usage scenario is adapting the model to a specific industrial context and using the dataset offered by the "Consumer Financial Protection Bureau" to illustrate the application.
Ndichu, Samuel, Ban, Tao, Takahashi, Takeshi, Inoue, Daisuke.  2021.  A Machine Learning Approach to Detection of Critical Alerts from Imbalanced Multi-Appliance Threat Alert Logs. 2021 IEEE International Conference on Big Data (Big Data). :2119–2127.
The extraordinary number of alerts generated by network intrusion detection systems (NIDS) can desensitize security analysts tasked with incident response. Security information and event management systems (SIEMs) perform some rudimentary automation but cannot replicate the decision-making process of a skilled analyst. Machine learning and artificial intelligence (AI) can detect patterns in data with appropriate training. In practice, the majority of the alert data comprises false alerts, and true alerts form only a small proportion. Consequently, a naive engine that classifies all security alerts into the majority class can yield a superficial high accuracy close to 100%. Without any correction for the class imbalance, the false alerts will dominate algorithmic predictions resulting in poor generalization performance. We propose a machine-learning approach to address the class imbalance problem in multi-appliance security alert data and automate the security alert analysis process performed in security operations centers (SOCs). We first used the neighborhood cleaning rule (NCR) to identify and remove ambiguous, noisy, and redundant false alerts. Then, we applied the support vector machine synthetic minority oversampling technique (SVMSMOTE) to generate synthetic training true alerts. Finally, we fit and evaluated the decision tree and random forest classifiers. In the experiments, using alert data from eight security appliances, we demonstrated that the proposed method can significantly reduce the need for manual auditing, decreasing the number of uninspected alerts and achieving a performance of 99.524% in recall.
Ye, YuGuang.  2021.  Research on the Security Defense Strategy of Smart City's Substitution Computer Network in Big Data. 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA). :1428–1431.
With the rapid development of the information technology era, the era of big data has also arrived. While computer networks are promoting the prosperity and development of society, their applications have become more extensive and in-depth. Smart city video surveillance systems have entered an era of networked surveillance and business integration. The problems are also endless. This article discusses computer network security in the era of big data, hoping to help strengthen the security of computer networks in our country. This paper studies the computer network security prevention strategies of smart cities in the era of big data.
Ahakonye, Love Allen Chijioke, Amaizu, Gabriel Chukwunonso, Nwakanma, Cosmas Ifeanyi, Lee, Jae Min, Kim, Dong-Seong.  2021.  Enhanced Vulnerability Detection in SCADA Systems using Hyper-Parameter-Tuned Ensemble Learning. 2021 International Conference on Information and Communication Technology Convergence (ICTC). :458–461.
The growth of inter-dependency intricacies of Supervisory Control and Data Acquisition (SCADA) systems in industrial operations generates a likelihood of increased vulnerability to malicious threats and machine learning approaches have been extensively utilized in the research for vulnerability detection. Nonetheless, to improve security, an enhanced vulnerability detection using hyper-parameter-tune machine learning is proposed for early detection, classification and mitigation of SCADA communication and transmission networks by classifying benign, or malicious DNS attacks. The proposed scheme, an ensemble optimizer (GentleBoost) upon hyper-parameter tuning, gave a comparative achievement. From the simulation results, the proposed scheme had an outstanding performance within the shortest possible time with an accuracy of 99.49%, 99.23% for precision, and a recall rate of 99.75%. Also, the model was compared to other contemporary algorithms and outperformed all the other algorithms proving to be an approach to keep abreast of the SCADA network vulnerabilities and attacks.
Tao, Yunting, Kong, Fanyu, Yu, Jia, Xu, Qiuliang.  2021.  Modification and Performance Improvement of Paillier Homomorphic Cryptosystem. 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC). :131–136.
Data security and privacy have become an important problem while big data systems are growing dramatically fast in various application fields. Paillier additive homomorphic cryptosystem is widely used in information security fields such as big data security, communication security, cloud computing security, and artificial intelligence security. However, how to improve its computational performance is one of the most critical problems in practice. In this paper, we propose two modifications to improve the performance of the Paillier cryptosystem. Firstly, we introduce a key generation method to generate the private key with low Hamming weight, and this can be used to accelerate the decryption computation of the Paillier cryptosystem. Secondly, we propose an acceleration method based on Hensel lifting in the Paillier cryptosystem. This method can obtain a faster and improved decryption process by showing the mathematical analysis of the decryption algorithm.
Zobaed, Sakib M, Salehi, Mohsen Amini, Buyya, Rajkumar.  2021.  SAED: Edge-Based Intelligence for Privacy-Preserving Enterprise Search on the Cloud. 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid). :366–375.
Cloud-based enterprise search services (e.g., AWS Kendra) have been entrancing big data owners by offering convenient and real-time search solutions to them. However, the problem is that individuals and organizations possessing confidential big data are hesitant to embrace such services due to valid data privacy concerns. In addition, to offer an intelligent search, these services access the user’s search history that further jeopardizes his/her privacy. To overcome the privacy problem, the main idea of this research is to separate the intelligence aspect of the search from its pattern matching aspect. According to this idea, the search intelligence is provided by an on-premises edge tier and the shared cloud tier only serves as an exhaustive pattern matching search utility. We propose Smartness at Edge (SAED mechanism that offers intelligence in the form of semantic and personalized search at the edge tier while maintaining privacy of the search on the cloud tier. At the edge tier, SAED uses a knowledge-based lexical database to expand the query and cover its semantics. SAED personalizes the search via an RNN model that can learn the user’s interest. A word embedding model is used to retrieve documents based on their semantic relevance to the search query. SAED is generic and can be plugged into existing enterprise search systems and enable them to offer intelligent and privacy-preserving search without enforcing any change on them. Evaluation results on two enterprise search systems under real settings and verified by human users demonstrate that SAED can improve the relevancy of the retrieved results by on average ≈24% for plain-text and ≈75% for encrypted generic datasets.
Bhagavan, Srini, Gharibi, Mohamed, Rao, Praveen.  2021.  FedSmarteum: Secure Federated Matrix Factorization Using Smart Contracts for Multi-Cloud Supply Chain. 2021 IEEE International Conference on Big Data (Big Data). :4054–4063.
With increased awareness comes unprecedented expectations. We live in a digital, cloud era wherein the underlying information architectures are expected to be elastic, secure, resilient, and handle petabyte scaling. The expectation of epic proportions from the next generation of the data frameworks is to not only do all of the above but also build it on a foundation of trust and explainability across multi-organization business networks. From cloud providers to automobile industries or even vaccine manufacturers, components are often sourced by a complex, not full digitized thread of disjoint suppliers. Building Machine Learning and AI-based order fulfillment and predictive models, remediating issues, is a challenge for multi-organization supply chain automation. We posit that Federated Learning in conjunction with blockchain and smart contracts are technologies primed to tackle data privacy and centralization challenges. In this paper, motivated by challenges in the industry, we propose a decentralized distributed system in conjunction with a recommendation system model (Matrix Factorization) that is trained using Federated Learning on an Ethereum blockchain network. We leverage smart contracts that allow decentralized serverless aggregation to update local-ized items vectors. Furthermore, we utilize Homomorphic Encryption (HE) to allow sharing the encrypted gradients over the network while maintaining their privacy. Based on our results, we argue that training a model over a serverless Blockchain network using smart contracts will provide the same accuracy as in a centralized model while maintaining our serverless model privacy and reducing the overhead communication to a central server. Finally, we assert such a system that provides transparency, audit-ready and deep insights into supply chain operations for enterprise cloud customers resulting in cost savings and higher Quality of Service (QoS).
Nayak, Lipsa, Jayalakshmi, V..  2021.  A Study of Securing Healthcare Big Data using DNA Encoding based ECC. 2021 6th International Conference on Inventive Computation Technologies (ICICT). :348—352.
IT world is migrating towards utilizing cloud computing as an essential data storing and exchanging platform. With the amelioration of technology, a colossal amount of data is generating with time. Cloud computing provides an enormous data storage capacity with the flexibility of accessing it without the time and place restrictions with virtualized resources. Healthcare industries spawn intense amounts of data from various medical instruments and digital records of patients. To access data remotely from any geographical location, the healthcare industry is moving towards cloud computing. EHR and PHR are patient's digital records, which include sensitive information of patients. Apart from all the proficient service provided by cloud computing, security is a primary concern for various organizations. To address the security issue, several cryptographic techniques implemented by researchers worldwide. In this paper, a vigorous cryptographic method discussed which is implemented by combining DNA cryptography and Elliptic Curve Cryptography to protect sensitive data in the cloud.
Salman, Zainab, Hammad, Mustafa, Al-Omary, Alauddin Yousif.  2021.  A Homomorphic Cloud Framework for Big Data Analytics Based on Elliptic Curve Cryptography. 2021 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). :7—11.
Homomorphic Encryption (HE) comes as a sophisticated and powerful cryptography system that can preserve the privacy of data in all cases when the data is at rest or even when data is in processing and computing. All the computations needed by the user or the provider can be done on the encrypted data without any need to decrypt it. However, HE has overheads such as big key sizes and long ciphertexts and as a result long execution time. This paper proposes a novel solution for big data analytic based on clustering and the Elliptical Curve Cryptography (ECC). The Extremely Distributed Clustering technique (EDC) has been used to divide big data into several subsets of cloud computing nodes. Different clustering techniques had been investigated, and it was found that using hybrid techniques can improve the performance and efficiency of big data analytic while at the same time data is protected and privacy is preserved using ECC.
Zhai, Hongqun, Zhang, Juan.  2021.  Research on Application of Radio Frequency Identification Technology in Intelligent Maritime Supervision. 2021 IEEE International Conference on Data Science and Computer Application (ICDSCA). :433–436.

The increasing volume of domestic and foreign trade brings new challenges to the efficiency and safety supervision of transportation. With the rapid development of Internet technology, it has opened up a new era of intelligent Internet of Things and the modern marine Internet of Vessels. Radio Frequency Identification technology strengthens the intelligent navigation and management of ships through the unique identification function of “label is object, object is label”. Intelligent Internet of Vessels can achieve the function of “limited electronic monitoring and unlimited electronic deterrence” combined with marine big data and Cyber Physical Systems, and further improve the level of modern maritime supervision and service.

Dijk, Allard.  2021.  Detection of Advanced Persistent Threats using Artificial Intelligence for Deep Packet Inspection. 2021 IEEE International Conference on Big Data (Big Data). :2092–2097.

Advanced persistent threats (APT’s) are stealthy threat actors with the skills to gain covert control of the computer network for an extended period of time. They are the highest cyber attack risk factor for large companies and states. A successful attack via an APT can cost millions of dollars, can disrupt civil life and has the capabilities to do physical damage. APT groups are typically state-sponsored and are considered the most effective and skilled cyber attackers. Attacks of APT’s are executed in several stages as pointed out in the Lockheed Martin cyber kill chain (CKC). Each of these APT stages can potentially be identified as patterns in network traffic. Using the "APT-2020" dataset, that compiles the characteristics and stages of an APT, we carried out experiments on the detection of anomalous traffic for all APT stages. We compare several artificial intelligence models, like a stacked auto encoder, a recurrent neural network and a one class state vector machine and show significant improvements on detection in the data exfiltration stage. This dataset is the first to have a data exfiltration stage included to experiment on. According to APT-2020’s authors current models have the biggest challenge specific to this stage. We introduce a method to successfully detect data exfiltration by analyzing the payload of the network traffic flow. This flow based deep packet inspection approach improves detection compared to other state of the art methods.

Shi, Pinyi, Song, Yongwook, Fei, Zongming, Griffioen, James.  2021.  Checking Network Security Policy Violations via Natural Language Questions. 2021 International Conference on Computer Communications and Networks (ICCCN). :1–9.
Network security policies provide high-level directives regarding acceptable and unacceptable use of the network. Organizations specify these high-level directives in policy documents written using human-readable natural language. The challenge is to convert these natural language policies to the network configurations/specifications needed to enforce the policy. Network administrators, who are responsible for enforcing the policies, typically translate the policies manually, which is a challenging and error-prone process. As a result, network operators (as well as the policy authors) often want to verify that network policies are being correctly enforced. In this paper, we propose Network Policy Conversation Engine (NPCE), a system designed to help network operators (or policy writers) interact with the network using natural language (similar to the language used in the network policy statements themselves) to understand whether policies are being correctly enforced. The system leverages emerging big data collection and analysis techniques to record flow and packet level activity throughout the network that can be used to answer users policy questions. The system also takes advantage of recent advances in Natural Language Processing (NLP) to translate natural language policy questions into the corresponding network queries. To evaluate our system, we demonstrate a wide range of policy questions – inspired by actual networks policies posted on university websites – that can be asked of the system to determine if a policy violation has occurred.
Vijayalakshmi, K., Jayalakshmi, V..  2021.  Identifying Considerable Anomalies and Conflicts in ABAC Security Policies. 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS). :1273–1280.
Nowadays security of shared resources and big data is an important and critical issue. With the growth of information technology and social networks, data and resources are shared in the distributed environment such as cloud and fog computing. Various access control models protect the shared resources from unauthorized users or malicious intruders. Despite the attribute-based access control model that meets the complex security requirement of todays' new computing technologies, considerable anomalies and conflicts in ABAC policies affect the efficiency of the security system. One important and toughest task is policy validation thus to detect and eliminate anomalies and conflicts in policies. Though the previous researches identified anomalies, failed to detect and analyze all considerable anomalies that results vulnerable to hacks and attacks. The primary objective of this paper is to study and analyze the possible anomalies and conflicts in ABAC security policies. We have discussed and analyzed considerable conflicts in policies based on previous researches. This paper can provide a detailed review of anomalies and conflicts in security policies.
Chen, Hao, Chen, Lin, Kuang, Xiaoyun, Xu, Aidong, Yang, Yiwei.  2021.  Support Forward Secure Smart Grid Data Deduplication and Deletion Mechanism. 2021 2nd Asia Symposium on Signal Processing (ASSP). :67–76.
With the vigorous development of the Internet and the widespread popularity of smart devices, the amount of data it generates has also increased exponentially, which has also promoted the generation and development of cloud computing and big data. Given cloud computing and big data technology, cloud storage has become a good solution for people to store and manage data at this stage. However, when cloud storage manages and regulates massive amounts of data, its security issues have become increasingly prominent. Aiming at a series of security problems caused by a malicious user's illegal operation of cloud storage and the loss of all data, this paper proposes a threshold signature scheme that is signed by a private key composed of multiple users. When this method performs key operations of cloud storage, multiple people are required to sign, which effectively prevents a small number of malicious users from violating data operations. At the same time, the threshold signature method in this paper uses a double update factor algorithm. Even if the attacker obtains the key information at this stage, he can not calculate the complete key information before and after the time period, thus having the two-way security and greatly improving the security of the data in the cloud storage.
Solanke, Abiodun A., Chen, Xihui, Ramírez-Cruz, Yunior.  2021.  Pattern Recognition and Reconstruction: Detecting Malicious Deletions in Textual Communications. 2021 IEEE International Conference on Big Data (Big Data). :2574–2582.
Digital forensic artifacts aim to provide evidence from digital sources for attributing blame to suspects, assessing their intents, corroborating their statements or alibis, etc. Textual data is a significant source of artifacts, which can take various forms, for instance in the form of communications. E-mails, memos, tweets, and text messages are all examples of textual communications. Complex statistical, linguistic and other scientific procedures can be manually applied to this data to uncover significant clues that point the way to factual information. While expert investigators can undertake this task, there is a possibility that critical information is missed or overlooked. The primary objective of this work is to aid investigators by partially automating the detection of suspicious e-mail deletions. Our approach consists in building a dynamic graph to represent the temporal evolution of communications, and then using a Variational Graph Autoencoder to detect possible e-mail deletions in this graph. Our model uses multiple types of features for representing node and edge attributes, some of which are based on metadata of the messages and the rest are extracted from the contents using natural language processing and text mining techniques. We use the autoencoder to detect missing edges, which we interpret as potential deletions; and to reconstruct their features, from which we emit hypotheses about the topics of deleted messages. We conducted an empirical evaluation of our model on the Enron e-mail dataset, which shows that our model is able to accurately detect a significant proportion of missing communications and to reconstruct the corresponding topic vectors.
Guo, Yifan, Wang, Qianlong, Ji, Tianxi, Wang, Xufei, Li, Pan.  2021.  Resisting Distributed Backdoor Attacks in Federated Learning: A Dynamic Norm Clipping Approach. 2021 IEEE International Conference on Big Data (Big Data). :1172—1182.
With the advance in artificial intelligence and high-dimensional data analysis, federated learning (FL) has emerged to allow distributed data providers to collaboratively learn without direct access to local sensitive data. However, limiting access to individual provider’s data inevitably incurs security issues. For instance, backdoor attacks, one of the most popular data poisoning attacks in FL, severely threaten the integrity and utility of the FL system. In particular, backdoor attacks launched by multiple collusive attackers, i.e., distributed backdoor attacks, can achieve high attack success rates and are hard to detect. Existing defensive approaches, like model inspection or model sanitization, often require to access a portion of local training data, which renders them inapplicable to the FL scenarios. Recently, the norm clipping approach is developed to effectively defend against distributed backdoor attacks in FL, which does not rely on local training data. However, we discover that adversaries can still bypass this defense scheme through robust training due to its unchanged norm clipping threshold. In this paper, we propose a novel defense scheme to resist distributed backdoor attacks in FL. Particularly, we first identify that the main reason for the failure of the norm clipping scheme is its fixed threshold in the training process, which cannot capture the dynamic nature of benign local updates during the global model’s convergence. Motivated by it, we devise a novel defense mechanism to dynamically adjust the norm clipping threshold of local updates. Moreover, we provide the convergence analysis of our defense scheme. By evaluating it on four non-IID public datasets, we observe that our defense scheme effectively can resist distributed backdoor attacks and ensure the global model’s convergence. Noticeably, our scheme reduces the attack success rates by 84.23% on average compared with existing defense schemes.
Shams, Montasir, Pavia, Sophie, Khan, Rituparna, Pyayt, Anna, Gubanov, Michael.  2021.  Towards Unveiling Dark Web Structured Data. 2021 IEEE International Conference on Big Data (Big Data). :5275—5282.
Anecdotal evidence suggests that Web-search engines, together with the Knowledge Graphs and Bases, such as YAGO [46], DBPedia [13], Freebase [16], Google Knowledge Graph [52] provide rapid access to most structured information on the Web. However, taking a closer look reveals a so called "knowledge gap" [18] that is largely in the dark. For example, a person searching for a relevant job opening has to spend at least 3 hours per week for several months [2] just searching job postings on numerous online job-search engines and the employer websites. The reason why this seemingly simple task cannot be completed by typing in a few keyword queries into a search-engine and getting all relevant results in seconds instead of hours is because access to structured data on the Web is still rudimentary. While searching for a job we have many parameters in mind, not just the job title, but also, usually location, salary range, remote work option, given a recent shift to hybrid work places, and many others. Ideally, we would like to write a SQL-style query, selecting all job postings satisfying our requirements, but it is currently impossible, because job postings (and all other) Web tables are structured in many different ways and scattered all over the Web. There is neither a Web-scale generalizable algorithm nor a system to locate and normalize all relevant tables in a category of interest from millions of sources.Here we describe and evaluate on a corpus having hundreds of millions of Web tables [39], a new scalable iterative training data generation algorithm, producing high quality training data required to train Deep- and Machine-learning models, capable of generalizing to Web scale. The models, trained on such en-riched training data efficiently deal with Web scale heterogeneity compared to poor generalization performance of models, trained without enrichment [20], [25], [38]. Such models are instrumental in bridging the knowledge gap for structured data on the Web.
Nair, Viswajit Vinod, van Staalduinen, Mark, Oosterman, Dion T..  2021.  Template Clustering for the Foundational Analysis of the Dark Web. 2021 IEEE International Conference on Big Data (Big Data). :2542—2549.
The rapid rise of the Dark Web and supportive technologies has served as the backbone facilitating online illegal activity worldwide. These illegal activities supported by anonymisation technologies such as Tor has made it increasingly elusive to law enforcement agencies. Despite several successful law enforcement operations, illegal activity on the Dark Web is still growing. There are approaches to monitor, mine, and research the Dark Web, all with varying degrees of success. Given the complexity and dynamics of the services offered, we recognize the need for in depth analysis of the Dark Web with regard to its infrastructures, actors, types of abuse and their relationships. This involves the challenging task of information extraction from the very heterogeneous collection of web pages that make up the Dark Web. Most providers develop their services on top of standard frameworks such as WordPress, Simple Machine Forum, phpBB and several other frameworks to deploy their services. As a result, these service providers publish significant number of pages based on similar structural and stylistic templates. We propose an efficient, scalable, repeatable and accurate approach to cluster Dark Web pages based on those structural and stylistic features. Extracting relevant information from those clusters should make it feasible to conduct in depth Dark Web analysis. This paper presents our clustering algorithm to accelerate information extraction, and as a result improve attribution of digital traces to infrastructures or individuals in the fight against cyber crime.
Dinh, Phuc Trinh, Park, Minho.  2021.  BDF-SDN: A Big Data Framework for DDoS Attack Detection in Large-Scale SDN-Based Cloud. 2021 IEEE Conference on Dependable and Secure Computing (DSC). :1–8.
Software-defined networking (SDN) nowadays is extensively being used in a variety of practical settings, provides a new way to manage networks by separating the data plane from its control plane. However, SDN is particularly vulnerable to Distributed Denial of Service (DDoS) attacks because of its centralized control logic. Many studies have been proposed to tackle DDoS attacks in an SDN design using machine-learning-based schemes; however, these feature-based detection schemes are highly resource-intensive and they are unable to perform reliably in such a large-scale SDN network where a massive amount of traffic data is generated from both control and data planes. This can deplete computing resources, degrade network performance, or even shut down the network systems owing to being exhausting resources. To address the above challenges, this paper proposes a big data framework to overcome traditional data processing limitations and to exploit distributed resources effectively for the most compute-intensive tasks such as DDoS attack detection using machine learning techniques, etc. We demonstrate the robustness, scalability, and effectiveness of our framework through practical experiments.
Li, Xin, Yi, Peng, Jiang, Yiming, Lu, Xiangyu.  2021.  Traffic Anomaly Detection Algorithm Based on Improved Salp Swarm Optimal Density Peak Clustering. 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD). :187—191.

Aiming at the problems of low accuracy and poor effect caused by the lack of data labels in most real network traffic, an optimized density peak clustering based on the improved salp swarm algorithm is proposed for traffic anomaly detection. Through the optimization of cosine decline and chaos strategy, the salp swarm algorithm not only accelerates the convergence speed, but also enhances the search ability. Moreover, we use the improved salp swarm algorithm to adaptively search the best truncation distance of density peak clustering, which avoids the subjectivity and uncertainty of manually selecting the parameters. The experimental results based on NSL-KDD dataset show that the improved salp swarm algorithm achieves faster convergence speed and higher precision, increases the average anomaly detection accuracy of 4.74% and detection rate of 6.14%, and reduces the average false positive rate of 7.38%.

Pölöskei, István.  2021.  Continuous natural language processing pipeline strategy. 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI). :000221—000224.
Natural language processing (NLP) is a division of artificial intelligence. The constructed model's quality is entirely reliant on the training dataset's quality. A data streaming pipeline is an adhesive application, completing a managed connection from data sources to machine learning methods. The recommended NLP pipeline composition has well-defined procedures. The implemented message broker design is a usual apparatus for delivering events. It makes it achievable to construct a robust training dataset for machine learning use-case and serve the model's input. The reconstructed dataset is a valid input for the machine learning processes. Based on the data pipeline's product, the model recreation and redeployment can be scheduled automatically.
Yang, Cuicui, Liu, Pinjie.  2021.  Big Data Nearest Neighbor Similar Data Retrieval Algorithm based on Improved Random Forest. 2021 International Conference on Big Data Analysis and Computer Science (BDACS). :175—178.
In the process of big data nearest neighbor similar data retrieval, affected by the way of data feature extraction, the retrieval accuracy is low. Therefore, this paper proposes the design of big data nearest neighbor similar data retrieval algorithm based on improved random forest. Through the improvement of random forest model and the construction of random decision tree, the characteristics of current nearest neighbor big data are clarified. Based on the improved random forest, the hash code is generated. Finally, combined with the Hamming distance calculation method, the nearest neighbor similar data retrieval of big data is realized. The experimental results show that: in the multi label environment, the retrieval accuracy is improved by 9% and 10%. In the single label environment, the similar data retrieval accuracy of the algorithm is improved by 12% and 28% respectively.