Visible to the public Biblio

Filters: Keyword is supervised learning  [Clear All Filters]
2021-06-24
Saletta, Martina, Ferretti, Claudio.  2020.  A Neural Embedding for Source Code: Security Analysis and CWE Lists. 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). :523—530.
In this paper, we design a technique for mapping the source code into a vector space and we show its application in the recognition of security weaknesses. By applying ideas commonly used in Natural Language Processing, we train a model for producing an embedding of programs starting from their Abstract Syntax Trees. We then show how such embedding is able to infer clusters roughly separating different classes of software weaknesses. Even if the training of the embedding is unsupervised and made on a generic Java dataset, we show that the model can be used for supervised learning of specific classes of vulnerabilities, helping to capture some features distinguishing them in code. Finally, we discuss how our model performs over the different types of vulnerabilities categorized by the CWE initiative.
2021-05-18
Zeng, Jingxiang, Nie, Xiaofan, Chen, Liwei, Li, Jinfeng, Du, Gewangzi, Shi, Gang.  2020.  An Efficient Vulnerability Extrapolation Using Similarity of Graph Kernel of PDGs. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :1664–1671.
Discovering the potential vulnerabilities in software plays a crucial role in ensuring the security of computer system. This paper proposes a method that can assist security auditors with the analysis of source code. When security auditors identify new vulnerabilities, our method can be adopted to make a list of recommendations that may have the same vulnerabilities for the security auditors. Our method relies on graph representation to automatically extract the mode of PDG(program dependence graph, a structure composed of control dependence and data dependence). Besides, it can be applied to the vulnerability extrapolation scenario, thus reducing the amount of audit code. We worked on an open-source vulnerability test set called Juliet. According to the evaluation results, the clustering effect produced is satisfactory, so that the feature vectors extracted by the Graph2Vec model are applied to labeling and supervised learning indicators are adopted to assess the model for its ability to extract features. On a total of 12,000 small data sets, the training score of the model can reach up to 99.2%, and the test score can reach a maximum of 85.2%. Finally, the recommendation effect of our work is verified as satisfactory.
2021-05-13
Wenhui, Sun, Kejin, Wang, Aichun, Zhu.  2020.  The Development of Artificial Intelligence Technology And Its Application in Communication Security. 2020 International Conference on Computer Engineering and Application (ICCEA). :752—756.
Artificial intelligence has been widely used in industries such as smart manufacturing, medical care and home furnishings. Among them, the value of the application in communication security is very important. This paper makes a further exploration of the artificial intelligence technology and its application, and gives a detailed analysis of its development, standardization and the application.
2021-05-05
Hallaji, Ehsan, Razavi-Far, Roozbeh, Saif, Mehrdad.  2020.  Detection of Malicious SCADA Communications via Multi-Subspace Feature Selection. 2020 International Joint Conference on Neural Networks (IJCNN). :1—8.
Security maintenance of Supervisory Control and Data Acquisition (SCADA) systems has been a point of interest during recent years. Numerous research works have been dedicated to the design of intrusion detection systems for securing SCADA communications. Nevertheless, these data-driven techniques are usually dependant on the quality of the monitored data. In this work, we propose a novel feature selection approach, called MSFS, to tackle undesirable quality of data caused by feature redundancy. In contrast to most feature selection techniques, the proposed method models each class in a different subspace, where it is optimally discriminated. This has been accomplished by resorting to ensemble learning, which enables the usage of multiple feature sets in the same feature space. The proposed method is then utilized to perform intrusion detection in smaller subspaces, which brings about efficiency and accuracy. Moreover, a comparative study is performed on a number of advanced feature selection algorithms. Furthermore, a dataset obtained from the SCADA system of a gas pipeline is employed to enable a realistic simulation. The results indicate the proposed approach extensively improves the detection performance in terms of classification accuracy and standard deviation.
2021-03-04
Wang, Y., Wang, Z., Xie, Z., Zhao, N., Chen, J., Zhang, W., Sui, K., Pei, D..  2020.  Practical and White-Box Anomaly Detection through Unsupervised and Active Learning. 2020 29th International Conference on Computer Communications and Networks (ICCCN). :1—9.

To ensure quality of service and user experience, large Internet companies often monitor various Key Performance Indicators (KPIs) of their systems so that they can detect anomalies and identify failure in real time. However, due to a large number of various KPIs and the lack of high-quality labels, existing KPI anomaly detection approaches either perform well only on certain types of KPIs or consume excessive resources. Therefore, to realize generic and practical KPI anomaly detection in the real world, we propose a KPI anomaly detection framework named iRRCF-Active, which contains an unsupervised and white-box anomaly detector based on Robust Random Cut Forest (RRCF), and an active learning component. Specifically, we novelly propose an improved RRCF (iRRCF) algorithm to overcome the drawbacks of applying original RRCF in KPI anomaly detection. Besides, we also incorporate the idea of active learning to make our model benefit from high-quality labels given by experienced operators. We conduct extensive experiments on a large-scale public dataset and a private dataset collected from a large commercial bank. The experimental resulta demonstrate that iRRCF-Active performs better than existing traditional statistical methods, unsupervised learning methods and supervised learning methods. Besides, each component in iRRCF-Active has also been demonstrated to be effective and indispensable.

2021-02-23
Liao, D., Huang, S., Tan, Y., Bai, G..  2020.  Network Intrusion Detection Method Based on GAN Model. 2020 International Conference on Computer Communication and Network Security (CCNS). :153—156.

The existing network intrusion detection methods have less label samples in the training process, and the detection accuracy is not high. In order to solve this problem, this paper designs a network intrusion detection method based on the GAN model by using the adversarial idea contained in the GAN. The model enhances the original training set by continuously generating samples, which expanding the label sample set. In order to realize the multi-classification of samples, this paper transforms the previous binary classification model of the generated adversarial network into a supervised learning multi-classification model. The loss function of training is redefined, so that the corresponding training method and parameter setting are obtained. Under the same experimental conditions, several performance indicators are used to compare the detection ability of the proposed method, the original classification model and other models. The experimental results show that the method proposed in this paper is more stable, robust, accurate detection rate, has good generalization ability, and can effectively realize network intrusion detection.

2021-02-22
Haile, J., Havens, S..  2020.  Identifying Ubiquitious Third-Party Libraries in Compiled Executables Using Annotated and Translated Disassembled Code with Supervised Machine Learning. 2020 IEEE Security and Privacy Workshops (SPW). :157–162.
The size and complexity of the software ecosystem is a major challenge for vendors, asset owners and cybersecurity professionals who need to understand the security posture of these systems. Annotated and Translated Disassembled Code is a graph based datastore designed to organize firmware and software analysis data across builds, packages and systems, providing a highly scalable platform enabling automated binary software analysis tasks including corpora construction and storage for machine learning. This paper describes an approach for the identification of ubiquitous third-party libraries in firmware and software using Annotated and Translated Disassembled Code and supervised machine learning. Annotated and Translated Disassembled Code provide matched libraries, function names and addresses of previously unidentified code in software as it is being automatically analyzed. This data can be ingested by other software analysis tools to improve accuracy and save time. Defenders can add the identified libraries to their vulnerability searches and add effective detection and mitigation into their operating environment.
2020-12-11
Fan, M., Luo, X., Liu, J., Wang, M., Nong, C., Zheng, Q., Liu, T..  2019.  Graph Embedding Based Familial Analysis of Android Malware using Unsupervised Learning. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). :771—782.

The rapid growth of Android malware has posed severe security threats to smartphone users. On the basis of the familial trait of Android malware observed by previous work, the familial analysis is a promising way to help analysts better focus on the commonalities of malware samples within the same families, thus reducing the analytical workload and accelerating malware analysis. The majority of existing approaches rely on supervised learning and face three main challenges, i.e., low accuracy, low efficiency, and the lack of labeled dataset. To address these challenges, we first construct a fine-grained behavior model by abstracting the program semantics into a set of subgraphs. Then, we propose SRA, a novel feature that depicts the similarity relationships between the Structural Roles of sensitive API call nodes in subgraphs. An SRA is obtained based on graph embedding techniques and represented as a vector, thus we can effectively reduce the high complexity of graph matching. After that, instead of training a classifier with labeled samples, we construct malware link network based on SRAs and apply community detection algorithms on it to group the unlabeled samples into groups. We implement these ideas in a system called GefDroid that performs Graph embedding based familial analysis of AnDroid malware using unsupervised learning. Moreover, we conduct extensive experiments to evaluate GefDroid on three datasets with ground truth. The results show that GefDroid can achieve high agreements (0.707-0.883 in term of NMI) between the clustering results and the ground truth. Furthermore, GefDroid requires only linear run-time overhead and takes around 8.6s to analyze a sample on average, which is considerably faster than the previous work.

2020-11-02
Pan, C., Huang, J., Gong, J., Yuan, X..  2019.  Few-Shot Transfer Learning for Text Classification With Lightweight Word Embedding Based Models. IEEE Access. 7:53296–53304.
Many deep learning architectures have been employed to model the semantic compositionality for text sequences, requiring a huge amount of supervised data for parameters training, making it unfeasible in situations where numerous annotated samples are not available or even do not exist. Different from data-hungry deep models, lightweight word embedding-based models could represent text sequences in a plug-and-play way due to their parameter-free property. In this paper, a modified hierarchical pooling strategy over pre-trained word embeddings is proposed for text classification in a few-shot transfer learning way. The model leverages and transfers knowledge obtained from some source domains to recognize and classify the unseen text sequences with just a handful of support examples in the target problem domain. The extensive experiments on five datasets including both English and Chinese text demonstrate that the simple word embedding-based models (SWEMs) with parameter-free pooling operations are able to abstract and represent the semantic text. The proposed modified hierarchical pooling method exhibits significant classification performance in the few-shot transfer learning tasks compared with other alternative methods.
Sharma, Sachin, Ghanshala, Kamal Kumar, Mohan, Seshadri.  2018.  A Security System Using Deep Learning Approach for Internet of Vehicles (IoV). 2018 9th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON). :1—5.

The Internet of Vehicles (IoV) will connect not only mobile devices with vehicles, but it will also connect vehicles with each other, and with smart offices, buildings, homes, theaters, shopping malls, and cities. The IoV facilitates optimal and reliable communication services to connected vehicles in smart cities. The backbone of connected vehicles communication is the critical V2X infrastructures deployment. The spectrum utilization depends on the demand by the end users and the development of infrastructure that includes efficient automation techniques together with the Internet of Things (IoT). The infrastructure enables us to build smart environments for spectrum utilization, which we refer to as Smart Spectrum Utilization (SSU). This paper presents an integrated system consisting of SSU with IoV. However, the tasks of securing IoV and protecting it from cyber attacks present considerable challenges. This paper introduces an IoV security system using deep learning approach to develop secure applications and reliable services. Deep learning composed of unsupervised learning and supervised learning, could optimize the IoV security system. The deep learning methodology is applied to monitor security threats. Results from simulations show that the monitoring accuracy of the proposed security system is superior to that of the traditional system.

2020-09-04
Khan, Aasher, Rehman, Suriya, Khan, Muhammad U.S, Ali, Mazhar.  2019.  Synonym-based Attack to Confuse Machine Learning Classifiers Using Black-box Setting. 2019 4th International Conference on Emerging Trends in Engineering, Sciences and Technology (ICEEST). :1—7.
Twitter being the most popular content sharing platform is giving rise to automated accounts called “bots”. Majority of the users on Twitter are bots. Various machine learning (ML) algorithms are designed to detect bots avoiding the vulnerability constraints of ML-based models. This paper contributes to exploit vulnerabilities of machine learning (ML) algorithms through black-box attack. An adversarial text sequence misclassifies the results of deep learning (DL) classifiers for bot detection. Literature shows that ML models are vulnerable to attacks. The aim of this paper is to compromise the accuracy of ML-based bot detection algorithms by replacing original words in tweets with their synonyms. Our results show 7.2% decrease in the accuracy for bot tweets, therefore classifying bot tweets as legitimate tweets.
2020-07-16
Ayub, Md. Ahsan, Smith, Steven, Siraj, Ambareen.  2019.  A Protocol Independent Approach in Network Covert Channel Detection. 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). :165—170.

Network covert channels are used in various cyberattacks, including disclosure of sensitive information and enabling stealth tunnels for botnet commands. With time and technology, covert channels are becoming more prevalent, complex, and difficult to detect. The current methods for detection are protocol and pattern specific. This requires the investment of significant time and resources into application of various techniques to catch the different types of covert channels. This paper reviews several patterns of network storage covert channels, describes generation of network traffic dataset with covert channels, and proposes a generic, protocol-independent approach for the detection of network storage covert channels using a supervised machine learning technique. The implementation of the proposed generic detection model can lead to a reduction of necessary techniques to prevent covert channel communication in network traffic. The datasets we have generated for experimentation represent storage covert channels in the IP, TCP, and DNS protocols and are available upon request for future research in this area.

2020-07-13
Agrawal, Shriyansh, Sanagavarapu, Lalit Mohan, Reddy, YR.  2019.  FACT - Fine grained Assessment of web page CredibiliTy. TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON). :1088–1097.
With more than a trillion web pages, there is a plethora of content available for consumption. Search Engine queries invariably lead to overwhelming information, parts of it relevant and some others irrelevant. Often the information provided can be conflicting, ambiguous, and inconsistent contributing to the loss of credibility of the content. In the past, researchers have proposed approaches for credibility assessment and enumerated factors influencing the credibility of web pages. In this work, we detailed a WEBCred framework for automated genre-aware credibility assessment of web pages. We developed a tool based on the proposed framework to extract web page features instances and identify genre a web page belongs to while assessing it's Genre Credibility Score ( GCS). We validated our approach on `Information Security' dataset of 8,550 URLs with 171 features across 7 genres. The supervised learning algorithm, Gradient Boosted Decision Tree classified genres with 88.75% testing accuracy over 10 fold cross-validation, an improvement over the current benchmark. We also examined our approach on `Health' domain web pages and had comparable results. The calculated GCS correlated 69% with crowdsourced Web Of Trust ( WOT) score and 13% with algorithm based Alexa ranking across 5 Information security groups. This variance in correlation states that our GCS approach aligns with human way ( WOT) as compared to algorithmic way (Alexa) of web assessment in both the experiments.
2020-07-03
Yan, Haonan, Li, Hui, Xiao, Mingchi, Dai, Rui, Zheng, Xianchun, Zhao, Xingwen, Li, Fenghua.  2019.  PGSM-DPI: Precisely Guided Signature Matching of Deep Packet Inspection for Traffic Analysis. 2019 IEEE Global Communications Conference (GLOBECOM). :1—6.

In the field of network traffic analysis, Deep Packet Inspection (DPI) technology is widely used at present. However, the increase in network traffic has brought tremendous processing pressure on the DPI. Consequently, detection speed has become the bottleneck of the entire application. In order to speed up the traffic detection of DPI, a lot of research works have been applied to improve signature matching algorithms, which is the most influential factor in DPI performance. In this paper, we present a novel method from a different angle called Precisely Guided Signature Matching (PGSM). Instead of matching packets with signature directly, we use supervised learning to automate the rules of specific protocol in PGSM. By testing the performance of a packet in the rules, the target packet could be decided when and which signatures should be matched with. Thus, the PGSM method reduces the number of aimless matches which are useless and numerous. After proposing PGSM, we build a framework called PGSM-DPI to verify the effectiveness of guidance rules. The PGSM-DPI framework consists of PGSM method and open source DPI library. The framework is running on a distributed platform with better throughput and computational performance. Finally, the experimental results demonstrate that our PGSM-DPI can reduce 59.23% original DPI time and increase 21.31% throughput. Besides, all source codes and experimental results can be accessed on our GitHub.

2020-06-12
Min, Congwen, Li, Yi, Fang, Li, Chen, Ping.  2019.  Conditional Generative Adversarial Network on Semi-supervised Learning Task. 2019 IEEE 5th International Conference on Computer and Communications (ICCC). :1448—1452.

Semi-supervised learning has recently gained increasingly attention because it can combine abundant unlabeled data with carefully labeled data to train deep neural networks. However, common semi-supervised methods deeply rely on the quality of pseudo labels. In this paper, we proposed a new semi-supervised learning method based on Generative Adversarial Network (GAN), by using discriminator to learn the feature of both labeled and unlabeled data, instead of generating pseudo labels that cannot all be correct. Our approach, semi-supervised conditional GAN (SCGAN), builds upon the conditional GAN model, extending it to semi-supervised learning by changing the discriminator's output to a classification output and a real or false output. We evaluate our approach with basic semi-supervised model on MNIST dataset. It shows that our approach achieves the classification accuracy with 84.15%, outperforming the basic semi-supervised model with 72.94%, when labeled data are 1/600 of all data.

2020-05-18
Nambiar, Sindhya K, Leons, Antony, Jose, Soniya, Arunsree.  2019.  Natural Language Processing Based Part of Speech Tagger using Hidden Markov Model. 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). :782–785.
In various natural language processing applications, PART-OF-SPEECH (POS) tagging is performed as a preprocessing step. For making POS tagging accurate, various techniques have been explored. But in Indian languages, not much work has been done. This paper describes the methods to build a Part of speech tagger by using hidden markov model. Supervised learning approach is implemented in which, already tagged sentences in malayalam is used to build hidden markov model.
2020-05-15
Jeyasudha, J., Usha, G..  2018.  Detection of Spammers in the Reconnaissance Phase by machine learning techniques. 2018 3rd International Conference on Inventive Computation Technologies (ICICT). :216—220.

Reconnaissance phase is where attackers identify their targets and how to collect information from professional social networks which can be used to select and exploit targeted employees to penetrate in an organization. Here, a framework is proposed for the early detection of attackers in the reconnaissance phase, highlighting the common characteristic behavior among attackers in professional social networks. And to create artificial honeypot profiles within the organizational social network which can be used to detect a potential incoming threat. By analyzing the dataset of social Network profiles in combination of machine learning techniques, A DspamRPfast model is proposed for the creation of a classifier system to predict the probabilities of the profiles being fake or malicious and to filter them out using XGBoost and for the faster classification and greater accuracy of 84.8%.

2020-02-10
Dan, Kenya, Kitagawa, Naoya, Sakuraba, Shuji, Yamai, Nariyoshi.  2019.  Spam Domain Detection Method Using Active DNS Data and E-Mail Reception Log. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 1:896–899.

E-mail is widespread and an essential communication technology in modern times. Since e-mail has problems with spam mails and spoofed e-mails, countermeasures are required. Although SPF, DKIM and DMARC have been proposed as sender domain authentication, these mechanisms cannot detect non-spoofing spam mails. To overcome this issue, this paper proposes a method to detect spam domains by supervised learning with features extracted from e-mail reception log and active DNS data, such as the result of Sender Authentication, the Sender IP address, the number of each DNS record, and so on. As a result of the experiment, our method can detect spam domains with 88.09% accuracy and 97.11% precision. We confirmed that our method can detect spam domains with detection accuracy 19.40% higher than the previous study by utilizing not only active DNS data but also e-mail reception log in combination.

2020-01-28
Bernardi, Mario Luca, Cimitile, Marta, Martinelli, Fabio, Mercaldo, Francesco.  2019.  Keystroke Analysis for User Identification Using Deep Neural Networks. 2019 International Joint Conference on Neural Networks (IJCNN). :1–8.

The current authentication systems based on password and pin code are not enough to guarantee attacks from malicious users. For this reason, in the last years, several studies are proposed with the aim to identify the users basing on their typing dynamics. In this paper, we propose a deep neural network architecture aimed to discriminate between different users using a set of keystroke features. The idea behind the proposed method is to identify the users silently and continuously during their typing on a monitored system. To perform such user identification effectively, we propose a feature model able to capture the typing style that is specific to each given user. The proposed approach is evaluated on a large dataset derived by integrating two real-world datasets from existing studies. The merged dataset contains a total of 1530 different users each writing a set of different typing samples. Several deep neural networks, with an increasing number of hidden layers and two different sets of features, are tested with the aim to find the best configuration. The final best classifier scores a precision equal to 0.997, a recall equal to 0.99 and an accuracy equal to 99% using an MLP deep neural network with 9 hidden layers. Finally, the performances obtained by using the deep learning approach are also compared with the performance of traditional decision-trees machine learning algorithm, attesting the effectiveness of the deep learning-based classifiers in the domain of keystroke analysis.

2020-01-27
Farag, Nadine, El-Seoud, Samir Abou, McKee, Gerard, Hassan, Ghada.  2019.  Bullying Hurts: A Survey on Non-Supervised Techniques for Cyber-Bullying Detection. Proceedings of the 2019 8th International Conference on Software and Information Engineering. :85–90.
The contemporary period is scarred by the predominant place of social media in everyday life. Despite social media being a useful tool for communication and social gathering it also offers opportunities for harmful criminal activities. One of these activities is cyber-bullying enabled through the abuse and mistreatment of the internet as a means of bullying others virtually. As a way of minimising this occurrence, research into computer-based researched is carried out to detect cyber-bullying by the scientific research community. An extensive literature search shows that supervised learning techniques are the most commonly used methods for cyber-bullying detection. However, some non-supervised techniques and other approaches have proven to be effective towards cyber-bullying detection. This paper, therefore, surveys recent research on non-supervised techniques and offers some suggestions for future research in textual-based cyber-bullying detection including detecting roles, detecting emotional state, automated annotation and stylometric methods.
2019-12-30
Heydari, Mohammad, Mylonas, Alexios, Katos, Vasilios, Balaguer-Ballester, Emili, Tafreshi, Vahid Heydari Fami, Benkhelifa, Elhadj.  2019.  Uncertainty-Aware Authentication Model for Fog Computing in IoT. 2019 Fourth International Conference on Fog and Mobile Edge Computing (FMEC). :52–59.

Since the term “Fog Computing” has been coined by Cisco Systems in 2012, security and privacy issues of this promising paradigm are still open challenges. Among various security challenges, Access Control is a crucial concern for all cloud computing-like systems (e.g. Fog computing, Mobile edge computing) in the IoT era. Therefore, assigning the precise level of access in such an inherently scalable, heterogeneous and dynamic environment is not easy to perform. This work defines the uncertainty challenge for authentication phase of the access control in fog computing because on one hand fog has a number of characteristics that amplify uncertainty in authentication and on the other hand applying traditional access control models does not result in a flexible and resilient solution. Therefore, we have proposed a novel prediction model based on the extension of Attribute Based Access Control (ABAC) model. Our data-driven model is able to handle uncertainty in authentication. It is also able to consider the mobility of mobile edge devices in order to handle authentication. In doing so, we have built our model using and comparing four supervised classification algorithms namely as Decision Tree, Naïve Bayes, Logistic Regression and Support Vector Machine. Our model can achieve authentication performance with 88.14% accuracy using Logistic Regression.

2019-11-26
Hassanpour, Reza, Dogdu, Erdogan, Choupani, Roya, Goker, Onur, Nazli, Nazli.  2018.  Phishing E-Mail Detection by Using Deep Learning Algorithms. Proceedings of the ACMSE 2018 Conference. :45:1-45:1.

Phishing e-mails are considered as spam e-mails, which aim to collect sensitive personal information about the users via network. Since the main purpose of this behavior is mostly to harm users financially, it is vital to detect these phishing or spam e-mails immediately to prevent unauthorized access to users' vital information. To detect phishing e-mails, using a quicker and robust classification method is important. Considering the billions of e-mails on the Internet, this classification process is supposed to be done in a limited time to analyze the results. In this work, we present some of the early results on the classification of spam email using deep learning and machine methods. We utilize word2vec to represent emails instead of using the popular keyword or other rule-based methods. Vector representations are then fed into a neural network to create a learning model. We have tested our method on an open dataset and found over 96% accuracy levels with the deep learning classification methods in comparison to the standard machine learning algorithms.

2018-11-19
Zhao, Zhi-Lin, Wang, Chang-Dong, Lin, Kun-Yu, Lai, Jian-Huang.  2017.  Missing Value Learning. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. :2427–2430.

Missing value is common in many machine learning problems and much effort has been made to handle missing data to improve the performance of the learned model. Sometimes, our task is not to train a model using those unlabeled/labeled data with missing value but process examples according to the values of some specified features. So, there is an urgent need of developing a method to predict those missing values. In this paper, we focus on learning from the known values to learn missing value as close as possible to the true one. It's difficult for us to predict missing value because we do not know the structure of the data matrix and some missing values may relate to some other missing values. We solve the problem by recovering the complete data matrix under the three reasonable constraints: feature relationship, upper recovery error bound and class relationship. The proposed algorithm can deal with both unlabeled and labeled data and generative adversarial idea will be used in labeled data to transfer knowledge. Extensive experiments have been conducted to show the effectiveness of the proposed algorithms.

2018-08-23
Halawa, Hassan, Ripeanu, Matei, Beznosov, Konstantin, Coskun, Baris, Liu, Meizhu.  2017.  An Early Warning System for Suspicious Accounts. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. :51–52.
In the face of large-scale automated cyber-attacks to large online services, fast detection and remediation of compromised accounts are crucial to limit the spread of new attacks and to mitigate the overall damage to users, companies, and the public at large. We advocate a fully automated approach based on machine learning to enable large-scale online service providers to quickly identify potentially compromised accounts. We develop an early warning system for the detection of suspicious account activity with the goal of quick identification and remediation of compromised accounts. We demonstrate the feasibility and applicability of our proposed system in a four month experiment at a large-scale online service provider using real-world production data encompassing hundreds of millions of users. We show that - even using only login data, features with low computational cost, and a basic model selection approach - around one out of five accounts later flagged as suspicious are correctly predicted a month in advance based on one week's worth of their login activity.
2018-06-07
Uwagbole, S. O., Buchanan, W. J., Fan, L..  2017.  An applied pattern-driven corpus to predictive analytics in mitigating SQL injection attack. 2017 Seventh International Conference on Emerging Security Technologies (EST). :12–17.

Emerging computing relies heavily on secure backend storage for the massive size of big data originating from the Internet of Things (IoT) smart devices to the Cloud-hosted web applications. Structured Query Language (SQL) Injection Attack (SQLIA) remains an intruder's exploit of choice to pilfer confidential data from the back-end database with damaging ramifications. The existing approaches were all before the new emerging computing in the context of the Internet big data mining and as such will lack the ability to cope with new signatures concealed in a large volume of web requests over time. Also, these existing approaches were strings lookup approaches aimed at on-premise application domain boundary, not applicable to roaming Cloud-hosted services' edge Software-Defined Network (SDN) to application endpoints with large web request hits. Using a Machine Learning (ML) approach provides scalable big data mining for SQLIA detection and prevention. Unfortunately, the absence of corpus to train a classifier is an issue well known in SQLIA research in applying Artificial Intelligence (AI) techniques. This paper presents an application context pattern-driven corpus to train a supervised learning model. The model is trained with ML algorithms of Two-Class Support Vector Machine (TC SVM) and Two-Class Logistic Regression (TC LR) implemented on Microsoft Azure Machine Learning (MAML) studio to mitigate SQLIA. This scheme presented here, then forms the subject of the empirical evaluation in Receiver Operating Characteristic (ROC) curve.