Visible to the public Biblio

Filters: Keyword is Predictive models  [Clear All Filters]
2019-11-12
Werner, Gordon, Okutan, Ahmet, Yang, Shanchieh, McConky, Katie.  2018.  Forecasting Cyberattacks as Time Series with Different Aggregation Granularity. 2018 IEEE International Symposium on Technologies for Homeland Security (HST). :1-7.

Cyber defense can no longer be limited to intrusion detection methods. These systems require malicious activity to enter an internal network before an attack can be detected. Having advanced, predictive knowledge of future attacks allow a potential victim to heighten security and possibly prevent any malicious traffic from breaching the network. This paper investigates the use of Auto-Regressive Integrated Moving Average (ARIMA) models and Bayesian Networks (BN) to predict future cyber attack occurrences and intensities against two target entities. In addition to incident count forecasting, categorical and binary occurrence metrics are proposed to better represent volume forecasts to a victim. Different measurement periods are used in time series construction to better model the temporal patterns unique to each attack type and target configuration, seeing over 86% improvement over baseline forecasts. Using ground truth aggregated over different measurement periods as signals, a BN is trained and tested for each attack type and the obtained results provided further evidence to support the findings from ARIMA. This work highlights the complexity of cyber attack occurrences; each subset has unique characteristics and is influenced by a number of potential external factors.

Zhang, Xian, Ben, Kerong, Zeng, Jie.  2018.  Cross-Entropy: A New Metric for Software Defect Prediction. 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). :111-122.

Defect prediction is an active topic in software quality assurance, which can help developers find potential bugs and make better use of resources. To improve prediction performance, this paper introduces cross-entropy, one common measure for natural language, as a new code metric into defect prediction tasks and proposes a framework called DefectLearner for this process. We first build a recurrent neural network language model to learn regularities in source code from software repository. Based on the trained model, the cross-entropy of each component can be calculated. To evaluate the discrimination for defect-proneness, cross-entropy is compared with 20 widely used metrics on 12 open-source projects. The experimental results show that cross-entropy metric is more discriminative than 50% of the traditional metrics. Besides, we combine cross-entropy with traditional metric suites together for accurate defect prediction. With cross-entropy added, the performance of prediction models is improved by an average of 2.8% in F1-score.

Ferenc, Rudolf, Heged\H us, Péter, Gyimesi, Péter, Antal, Gábor, Bán, Dénes, Gyimóthy, Tibor.  2019.  Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions. 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). :8-14.

The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.

Wei, Shengjun, Zhong, Hao, Shan, Chun, Ye, Lin, Du, Xiaojiang, Guizani, Mohsen.  2018.  Vulnerability Prediction Based on Weighted Software Network for Secure Software Building. 2018 IEEE Global Communications Conference (GLOBECOM). :1-6.

To build a secure communications software, Vulnerability Prediction Models (VPMs) are used to predict vulnerable software modules in the software system before software security testing. At present many software security metrics have been proposed to design a VPM. In this paper, we predict vulnerable classes in a software system by establishing the system's weighted software network. The metrics are obtained from the nodes' attributes in the weighted software network. We design and implement a crawler tool to collect all public security vulnerabilities in Mozilla Firefox. Based on these data, the prediction model is trained and tested. The results show that the VPM based on weighted software network has a good performance in accuracy, precision, and recall. Compared to other studies, it shows that the performance of prediction has been improved greatly in Pr and Re.

2019-09-26
Miletić, M., Vuku\v sić, M., Mau\v sa, G., Grbac, T. G..  2018.  Cross-Release Code Churn Impact on Effort-Aware Software Defect Prediction. 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). :1460-1466.

Code churn has been successfully used to identify defect inducing changes in software development. Our recent analysis of the cross-release code churn showed that several design metrics exhibit moderate correlation with the number of defects in complex systems. The goal of this paper is to explore whether cross-release code churn can be used to identify critical design change and contribute to prediction of defects for software in evolution. In our case study, we used two types of data from consecutive releases of open-source projects, with and without cross-release code churn, to build standard prediction models. The prediction models were trained on earlier releases and tested on the following ones, evaluating the performance in terms of AUC, GM and effort aware measure Pop. The comparison of their performance was used to answer our research question. The obtained results showed that the prediction model performs better when cross-release code churn is included. Practical implication of this research is to use cross-release code churn to aid in safe planning of next release in software development.

2019-08-05
Kaiafas, G., Varisteas, G., Lagraa, S., State, R., Nguyen, C. D., Ries, T., Ourdane, M..  2018.  Detecting Malicious Authentication Events Trustfully. NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1-6.

Anomaly detection on security logs is receiving more and more attention. Authentication events are an important component of security logs, and being able to produce trustful and accurate predictions minimizes the effort of cyber-experts to stop false attacks. Observed events are classified into Normal, for legitimate user behavior, and Malicious, for malevolent actions. These classes are consistently excessively imbalanced which makes the classification problem harder; in the commonly used Los Alamos dataset, the malicious class comprises only 0.00033% of the total. This work proposes a novel method to extract advanced composite features, and a supervised learning technique for classifying authentication logs trustfully; the models are Random Forest, LogitBoost, Logistic Regression, and ultimately Majority Voting which leverages the predictions of the previous models and gives the final prediction for each authentication event. We measure the performance of our experiments by using the False Negative Rate and False Positive Rate. In overall we achieve 0 False Negative Rate (i.e. no attack was missed), and on average a False Positive Rate of 0.0019.

2019-07-01
Clemente, C. J., Jaafar, F., Malik, Y..  2018.  Is Predicting Software Security Bugs Using Deep Learning Better Than the Traditional Machine Learning Algorithms? 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). :95–102.

Software insecurity is being identified as one of the leading causes of security breaches. In this paper, we revisited one of the strategies in solving software insecurity, which is the use of software quality metrics. We utilized a multilayer deep feedforward network in examining whether there is a combination of metrics that can predict the appearance of security-related bugs. We also applied the traditional machine learning algorithms such as decision tree, random forest, naïve bayes, and support vector machines and compared the results with that of the Deep Learning technique. The results have successfully demonstrated that it was possible to develop an effective predictive model to forecast software insecurity based on the software metrics and using Deep Learning. All the models generated have shown an accuracy of more than sixty percent with Deep Learning leading the list. This finding proved that utilizing Deep Learning methods and a combination of software metrics can be tapped to create a better forecasting model thereby aiding software developers in predicting security bugs.

2019-05-08
Yaseen, Q., Alabdulrazzaq, A., Albalas, F..  2019.  A Framework for Insider Collusion Threat Prediction and Mitigation in Relational Databases. 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC). :0721–0727.

This paper proposes a framework for predicting and mitigating insider collusion threat in relational database systems. The proposed model provides a robust technique for database architect and administrators to predict insider collusion threat when designing database schema or when granting privileges. Moreover, it proposes a real time monitoring technique that monitors the growing knowledgebases of insiders while executing transactions and the possible collusion insider attacks that may be launched based on insiders accesses and inferences. Furthermore, the paper proposes a mitigating technique based on the segregation of duties principle and the discovered collusion insider threat to mitigate the problem. The proposed model was tested to show its usefulness and applicability.

2019-02-08
Nichols, W., Hawrylak, P. J., Hale, J., Papa, M..  2018.  Methodology to Estimate Attack Graph System State from a Simulation of a Nuclear Research Reactor. 2018 Resilience Week (RWS). :84-87.
Hybrid attack graphs are a powerful tool when analyzing the cybersecurity of a cyber-physical system. However, it is important to ensure that this tool correctly models reality, particularly when modelling safety-critical applications, such as a nuclear reactor. By automatically verifying that a simulation reaches the state predicted by an attack graph by analyzing the final state of the simulation, this verification procedure can be accomplished. As such, a mechanism to estimate if a simulation reaches the expected state in a hybrid attack graph is proposed here for the nuclear reactor domain.
2019-01-21
Venkatesan, S., Sugrim, S., Izmailov, R., Chiang, C. J., Chadha, R., Doshi, B., Hoffman, B., Newcomb, E. Allison, Buchler, N..  2018.  On Detecting Manifestation of Adversary Characteristics. MILCOM 2018 - 2018 IEEE Military Communications Conference (MILCOM). :431–437.

Adversaries are conducting attack campaigns with increasing levels of sophistication. Additionally, with the prevalence of out-of-the-box toolkits that simplify attack operations during different stages of an attack campaign, multiple new adversaries and attack groups have appeared over the past decade. Characterizing the behavior and the modus operandi of different adversaries is critical in identifying the appropriate security maneuver to detect and mitigate the impact of an ongoing attack. To this end, in this paper, we study two characteristics of an adversary: Risk-averseness and Experience level. Risk-averse adversaries are more cautious during their campaign while fledgling adversaries do not wait to develop adequate expertise and knowledge before launching attack campaigns. One manifestation of these characteristics is through the adversary's choice and usage of attack tools. To detect these characteristics, we present multi-level machine learning (ML) models that use network data generated while under attack by different attack tools and usage patterns. In particular, for risk-averseness, we considered different configurations for scanning tools and trained the models in a testbed environment. The resulting model was used to predict the cautiousness of different red teams that participated in the Cyber Shield ‘16 exercise. The predictions matched the expected behavior of the red teams. For Experience level, we considered publicly-available remote access tools and usage patterns. We developed a Markov model to simulate usage patterns of attackers with different levels of expertise and through experiments on CyberVAN, we showed that the ML model has a high accuracy.

2019-01-16
Aloui, M., Elbiaze, H., Glitho, R., Yangui, S..  2018.  Analytics as a service architecture for cloud-based CDN: Case of video popularity prediction. 2018 15th IEEE Annual Consumer Communications Networking Conference (CCNC). :1–4.
User Generated Videos (UGV) are the dominating content stored in scattered caches to meet end-user Content Delivery Networks (CDN) requests with quality of service. End-User behaviour leads to a highly variable UGV popularity. This aspect can be exploited to efficiently utilize the limited storage of the caches, and improve the hit ratio of UGVs. In this paper, we propose a new architecture for Data Analytics in Cloud-based CDN to derive UGVs popularity online. This architecture uses RESTful web services to gather CDN logs, store them through generic collections in a NoSQL database, and calculate related popular UGVs in a real time fashion. It uses a dynamic model training and prediction services to provide each CDN with related popular videos to be cached based on the latest trained model. The proposed architecture is implemented with k-means clustering prediction model and the obtained results are 99.8% accurate.
Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., Zhu, J..  2018.  Defense Against Adversarial Attacks Using High-Level Representation Guided Denoiser. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. :1778–1787.
Neural networks are vulnerable to adversarial examples, which poses a threat to their application in security sensitive systems. We propose high-level representation guided denoiser (HGD) as a defense for image classification. Standard denoiser suffers from the error amplification effect, in which small residual adversarial noise is progressively amplified and leads to wrong classifications. HGD overcomes this problem by using a loss function defined as the difference between the target model's outputs activated by the clean image and denoised image. Compared with ensemble adversarial training which is the state-of-the-art defending method on large images, HGD has three advantages. First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images and unseen classes. Third, HGD can be transferred to defend models other than the one guiding it. In NIPS competition on defense against adversarial attacks, our HGD solution won the first place and outperformed other models by a large margin.1
2018-12-10
Ndichu, S., Ozawa, S., Misu, T., Okada, K..  2018.  A Machine Learning Approach to Malicious JavaScript Detection using Fixed Length Vector Representation. 2018 International Joint Conference on Neural Networks (IJCNN). :1–8.

To add more functionality and enhance usability of web applications, JavaScript (JS) is frequently used. Even with many advantages and usefulness of JS, an annoying fact is that many recent cyberattacks such as drive-by-download attacks exploit vulnerability of JS codes. In general, malicious JS codes are not easy to detect, because they sneakily exploit vulnerabilities of browsers and plugin software, and attack visitors of a web site unknowingly. To protect users from such threads, the development of an accurate detection system for malicious JS is soliciting. Conventional approaches often employ signature and heuristic-based methods, which are prone to suffer from zero-day attacks, i.e., causing many false negatives and/or false positives. For this problem, this paper adopts a machine-learning approach to feature learning called Doc2Vec, which is a neural network model that can learn context information of texts. The extracted features are given to a classifier model (e.g., SVM and neural networks) and it judges the maliciousness of a JS code. In the performance evaluation, we use the D3M Dataset (Drive-by-Download Data by Marionette) for malicious JS codes and JSUPACK for benign ones for both training and test purposes. We then compare the performance to other feature learning methods. Our experimental results show that the proposed Doc2Vec features provide better accuracy and fast classification in malicious JS code detection compared to conventional approaches.

2018-06-11
Cai, Y., Huang, H., Cai, H., Qi, Y..  2017.  A K-nearest neighbor locally search regression algorithm for short-term traffic flow forecasting. 2017 9th International Conference on Modelling, Identification and Control (ICMIC). :624–629.

Accurate short-term traffic flow forecasting is of great significance for real-time traffic control, guidance and management. The k-nearest neighbor (k-NN) model is a classic data-driven method which is relatively effective yet simple to implement for short-term traffic flow forecasting. For conventional prediction mechanism of k-NN model, the k nearest neighbors' outputs weighted by similarities between the current traffic flow vector and historical traffic flow vectors is directly used to generate prediction values, so that the prediction results are always not ideal. It is observed that there are always some outliers in k nearest neighbors' outputs, which may have a bad influences on the prediction value, and the local similarities between current traffic flow and historical traffic flows at the current sampling period should have a greater relevant to the prediction value. In this paper, we focus on improving the prediction mechanism of k-NN model and proposed a k-nearest neighbor locally search regression algorithm (k-LSR). The k-LSR algorithm can use locally search strategy to search for optimal nearest neighbors' outputs and use optimal nearest neighbors' outputs weighted by local similarities to forecast short-term traffic flow so as to improve the prediction mechanism of k-NN model. The proposed algorithm is tested on the actual data and compared with other algorithms in performance. We use the root mean squared error (RMSE) as the evaluation indicator. The comparison results show that the k-LSR algorithm is more successful than the k-NN and k-nearest neighbor locally weighted regression algorithm (k-LWR) in forecasting short-term traffic flow, and which prove the superiority and good practicability of the proposed algorithm.

Zhang, X., Li, R., Zhao, W., Wu, R..  2017.  Detection of malicious nodes in NDN VANET for Interest Packet Popple Broadcast Diffusion Attack. 2017 11th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID). :114–118.

As one of the next generation network architectures, Named Data Networking(NDN) which features location-independent addressing and content caching makes it more suitable to be deployed into Vehicular Ad-hoc Network(VANET). However, a new attack pattern is found when NDN and VANET combine. This new attack is Interest Packet Popple Broadcast Diffusion Attack (PBDA). There is no mitigation strategies to mitigate PBDA. In this paper a mitigation strategies called RVMS based on node reputation value (RV) is proposed to detect malicious nodes. The node calculates the neighbor node RV by direct and indirect RV evaluation and uses Markov chain predict the current RV state of the neighbor node according to its historical RV. The RV state is used to decide whether to discard the interest packet. Finally, the effectiveness of the RVMS is verified through modeling and experiment. The experimental results show that the RVMS can mitigate PBDA.

2018-05-30
Howard, M., Pfeffer, A., Dalai, M., Reposa, M..  2017.  Predicting Signatures of Future Malware Variants. 2017 12th International Conference on Malicious and Unwanted Software (MALWARE). :126–132.
One of the challenges of malware defense is that the attacker has the advantage over the defender. In many cases, an attack is successful and causes damage before the defender can even begin to prepare a defense. The ability to anticipate attacks and prepare defenses before they occur would be a significant scientific and technological development with practical applications in cybersecurity. In this paper, we present a method to augment machine learning-based malware detection systems by predicting signatures of future malware variants and injecting these variants into the defensive system as a vaccine. Our method uses deep learning to learn patterns of malware evolution from family histories. These evolution patterns are then used to predict future family developments. Our experiments show that a detection system augmented with these future malware signatures is able to detect future malware variants that could not be detected by the detection system alone. In particular, it detected 11 new malware variants without increasing false positives, while providing up to 5 months of lead time between prediction and attack.
2018-05-24
Hassan, M., Hamada, M..  2017.  A Computational Model for Improving the Accuracy of Multi-Criteria Recommender Systems. 2017 IEEE 11th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC). :114–119.

Artificial neural networks are complex biologically inspired algorithms made up of highly distributed, adaptive and self-organizing structures that make them suitable for optimization problems. They are made up of a group of interconnected nodes, similar to the great networks of neurons in the human brain. So far, artificial neural networks have not been applied to user modeling in multi-criteria recommender systems. This paper presents neural networks-based user modeling technique that exploits some of the characteristics of biological neurons for improving the accuracy of multi-criteria recommendations. The study was based upon the aggregation function approach that computes the overall rating as a function of the criteria ratings. The proposed technique was evaluated using different evaluation metrics, and the empirical results of the experiments were compared with that of the single rating-based collaborative filtering and two other similarity-based modeling approaches. The two similarity-based techniques used are: the worst-case and the average similarity techniques. The results of the comparative analysis have shown that the proposed technique is more efficient than the two similarity-based techniques and the single rating collaborative filtering technique.

2018-05-09
Zeng, Y. G..  2017.  Identifying Email Threats Using Predictive Analysis. 2017 International Conference on Cyber Security And Protection Of Digital Services (Cyber Security). :1–2.

Malicious emails pose substantial threats to businesses. Whether it is a malware attachment or a URL leading to malware, exploitation or phishing, attackers have been employing emails as an effective way to gain a foothold inside organizations of all kinds. To combat email threats, especially targeted attacks, traditional signature- and rule-based email filtering as well as advanced sandboxing technology both have their own weaknesses. In this paper, we propose a predictive analysis approach that learns the differences between legit and malicious emails through static analysis, creates a machine learning model and makes detection and prediction on unseen emails effectively and efficiently. By comparing three different machine learning algorithms, our preliminary evaluation reveals that a Random Forests model performs the best.

2018-05-01
Benthall, S..  2017.  Assessing Software Supply Chain Risk Using Public Data. 2017 IEEE 28th Annual Software Technology Conference (STC). :1–5.

The software supply chain is a source of cybersecurity risk for many commercial and government organizations. Public data may be used to inform automated tools for detecting software supply chain risk during continuous integration and deployment. We link data from the National Vulnerability Database (NVD) with open version control data for the open source project OpenSSL, a widely used secure networking library that made the news when a significant vulnerability, Heartbleed, was discovered in 2014. We apply the Alhazmi-Malaiya Logistic (AML) model for software vulnerability discovery to this case. This model predicts a sigmoid cumulative vulnerability discovery function over time. Some versions of OpenSSL do not conform to the predictions of the model because they contain a temporary plateau in the cumulative vulnerability discovery plot. This temporary plateau feature is an empirical signature of a security failure mode that may be useful in future studies of software supply chain risk.

2018-04-04
Majumder, R., Som, S., Gupta, R..  2017.  Vulnerability prediction through self-learning model. 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS). :400–402.

Vulnerability being the buzz word in the modern time is the most important jargon related to software and operating system. Since every now and then, software is developed some loopholes and incompleteness lie in the development phase, so there always remains a vulnerability of abruptness in it which can come into picture anytime. Detecting vulnerability is one thing and predicting its occurrence in the due course of time is another thing. If we get to know the vulnerability of any software in the due course of time then it acts as an active alarm for the developers to again develop sound and improvised software the second time. The proposal talks about the implementation of the idea using the artificial neural network, where different data sets are being given as input for being used for further analysis for successful results. As of now, there are models for studying the vulnerabilities in the software and networks, this paper proposal in addition to the current work, will throw light on the predictability of vulnerabilities over the due course of time.

2018-04-02
Hill, Z., Nichols, W. M., Papa, M., Hale, J. C., Hawrylak, P. J..  2017.  Verifying Attack Graphs through Simulation. 2017 Resilience Week (RWS). :64–67.

Verifying attacks against cyber physical systems can be a costly and time-consuming process. By using a simulated environment, attacks can be verified quickly and accurately. By combining the simulation of a cyber physical system with a hybrid attack graph, the effects of a series of exploits can be accurately analysed. Furthermore, the use of a simulated environment to verify attacks may uncover new information about the nature of the attacks.

2018-03-26
Movahedi, Y., Cukier, M., Andongabo, A., Gashi, I..  2017.  Cluster-Based Vulnerability Assessment Applied to Operating Systems. 2017 13th European Dependable Computing Conference (EDCC). :18–25.

Organizations face the issue of how to best allocate their security resources. Thus, they need an accurate method for assessing how many new vulnerabilities will be reported for the operating systems (OSs) they use in a given time period. Our approach consists of clustering vulnerabilities by leveraging the text information within vulnerability records, and then simulating the mean value function of vulnerabilities by relaxing the monotonic intensity function assumption, which is prevalent among the studies that use software reliability models (SRMs) and nonhomogeneous Poisson process (NHPP) in modeling. We applied our approach to the vulnerabilities of four OSs: Windows, Mac, IOS, and Linux. For the OSs analyzed in terms of curve fitting and prediction capability, our results, compared to a power-law model without clustering issued from a family of SRMs, are more accurate in all cases we analyzed.

2018-03-19
Wang, A., Mohaisen, A., Chen, S..  2017.  An Adversary-Centric Behavior Modeling of DDoS Attacks. 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). :1126–1136.

Distributed Denial of Service (DDoS) attacks are some of the most persistent threats on the Internet today. The evolution of DDoS attacks calls for an in-depth analysis of those attacks. A better understanding of the attackers' behavior can provide insights to unveil patterns and strategies utilized by attackers. The prior art on the attackers' behavior analysis often falls in two aspects: it assumes that adversaries are static, and makes certain simplifying assumptions on their behavior, which often are not supported by real attack data. In this paper, we take a data-driven approach to designing and validating three DDoS attack models from temporal (e.g., attack magnitudes), spatial (e.g., attacker origin), and spatiotemporal (e.g., attack inter-launching time) perspectives. We design these models based on the analysis of traces consisting of more than 50,000 verified DDoS attacks from industrial mitigation operations. Each model is also validated by testing its effectiveness in accurately predicting future DDoS attacks. Comparisons against simple intuitive models further show that our models can more accurately capture the essential features of DDoS attacks.

2018-02-14
Feng, C., Wu, S., Liu, N..  2017.  A user-centric machine learning framework for cyber security operations center. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :173–175.

To assure cyber security of an enterprise, typically SIEM (Security Information and Event Management) system is in place to normalize security events from different preventive technologies and flag alerts. Analysts in the security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. However, generally the number of alerts is overwhelming with majority of them being false positive and exceeding the SOC's capacity to handle all alerts. Because of this, potential malicious attacks and compromised hosts may be missed. Machine learning is a viable approach to reduce the false positive rate and improve the productivity of SOC analysts. In this paper, we develop a user-centric machine learning framework for the cyber security operation center in real enterprise environment. We discuss the typical data sources in SOC, their work flow, and how to leverage and process these data sets to build an effective machine learning system. The paper is targeted towards two groups of readers. The first group is data scientists or machine learning researchers who do not have cyber security domain knowledge but want to build machine learning systems for security operations center. The second group of audiences are those cyber security practitioners who have deep knowledge and expertise in cyber security, but do not have machine learning experiences and wish to build one by themselves. Throughout the paper, we use the system we built in the Symantec SOC production environment as an example to demonstrate the complete steps from data collection, label creation, feature engineering, machine learning algorithm selection, model performance evaluations, to risk score generation.

2018-01-23
Kilgallon, S., Rosa, L. De La, Cavazos, J..  2017.  Improving the effectiveness and efficiency of dynamic malware analysis with machine learning. 2017 Resilience Week (RWS). :30–36.

As the malware threat landscape is constantly evolving and over one million new malware strains are being generated every day [1], early automatic detection of threats constitutes a top priority of cybersecurity research, and amplifies the need for more advanced detection and classification methods that are effective and efficient. In this paper, we present the application of machine learning algorithms to predict the length of time malware should be executed in a sandbox to reveal its malicious intent. We also introduce a novel hybrid approach to malware classification based on static binary analysis and dynamic analysis of malware. Static analysis extracts information from a binary file without executing it, and dynamic analysis captures the behavior of malware in a sandbox environment. Our experimental results show that by turning the aforementioned problems into machine learning problems, it is possible to get an accuracy of up to 90% on the prediction of the malware analysis run time and up to 92% on the classification of malware families.