Visible to the public Machine Learning

SoS Newsletter- Advanced Book Block

Machine Learning

Machine learning offers potential efficiencies and is an important tool in data mining. However, the "learned" or derived data must maintain integrity. Machine learning can also be used to identify threats and attacks. Research in this field is of particular interest in sensitive industries, including healthcare. The works cited here appeared in the first half of 2014.

  • Mozaffari Kermani, M.; Sur-Kolay, S.; Raghunathan, A; Jha, N.K., "Systematic Poisoning Attacks on and Defenses for Machine Learning in Healthcare," Biomedical and Health Informatics, IEEE Journal of, vol. PP, no.99, pp.1,1, July 2014. doi: 10.1109/JBHI.2014.2344095 Machine learning is being used in a wide range of application domains to discover patterns in large datasets. Increasingly, the results of machine learning drive critical decisions in applications related to healthcare and biomedicine. Such health-related applications are often sensitive and, thus, any security breach would be catastrophic. Naturally, the integrity of the results computed by machine learning is of great importance. Recent research has shown that some machine learning algorithms can be compromised by augmenting their training datasets with malicious data, leading to a new class of attacks called poisoning attacks. Hindrance of a diagnosis may have life threatening consequences and could cause distrust. On the other hand, not only may a false diagnosis prompt users to distrust the machine learning algorithm and even abandon the entire system but also such a false positive classification may cause patient distress. In this paper, we present a systematic, algorithm independent approach for mounting poisoning attacks across a wide range of machine learning algorithms and healthcare datasets. The proposed attack procedure generates input data, which, when added to the training set, can either cause the results of machine learning to have targeted errors (e.g., increase the likelihood of classification into a specific class), or simply introduce arbitrary errors (incorrect classification). These attacks may be applied to both fixed and evolving datasets. They can be applied even when only statistics of the training dataset are available or, in some cases, even without access to the training dataset, although at a lower efficacy. We establish the effectiveness of the proposed attacks using a suite of six machine learning algorithms and five healthcare datasets. Finally, we present countermeasures against the proposed generic attacks that are based on tracking and detecting deviations in various accuracy metrics, and benchmark their effectiveness. Keywords: (not provided) (ID#:14-2388) URL:
  • Baughman, AK.; Chuang, W.; Dixon, K.R.; Benz, Z.; Basilico, J., "DeepQA Jeopardy! Gamification: A Machine-Learning Perspective," Computational Intelligence and AI in Games, IEEE Transactions on , vol.6, no.1, pp.55,66, March 2014. doi: 10.1109/TCIAIG.2013.2285651 DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems. Keywords: business data processing ;computer games; learning (artificial intelligence);natural language processing; question answering (information retrieval) ;text analysis; DeepQA Jeopardy! gamification; NLP algorithms; NLP question-and-answer system; Newton-Raphson logistic regression; Watson gamification; Watson machine-learning pipelines; algorithm engineering; business domain; configuration settings; data engineering techniques; domain gamification; extreme class imbalance; ingested corpus sets ;large-scale natural language processing question-and-answer system; machine-learning process; nongame play; positive-to-negative ratio; question-and-answer pairs; real-world business problems; regularization term; structured data; training instances; true-false labels; unstructured data; Accuracy; Games ;Logistics; Machine learning algorithms; Pipelines; Training; Gamification; machine learning; natural language processing (NLP);pattern recognition (ID#:14-2389) URL:
  • Stevanovic, M.; Pedersen, J.M., "An Efficient Flow-Based Botnet Detection Using Supervised Machine Learning," Computing, Networking and Communications (ICNC), 2014 International Conference on, pp.797, 801, 3-6 Feb. 2014. doi: 10.1109/ICCNC.2014.6785439 Botnet detection represents one of the most crucial prerequisites of successful botnet neutralization. This paper explores how accurate and timely detection can be achieved by using supervised machine learning as the tool of inferring about malicious botnet traffic. In order to do so, the paper introduces a novel flow-based detection system that relies on supervised machine learning for identifying botnet network traffic. For use in the system we consider eight highly regarded machine learning algorithms, indicating the best performing one. Furthermore, the paper evaluates how much traffic needs to be observed per flow in order to capture the patterns of malicious traffic. The proposed system has been tested through the series of experiments using traffic traces originating from two well-known P2P botnets and diverse non-malicious applications. The results of experiments indicate that the system is able to accurately and timely detect botnet traffic using purely flow-based traffic analysis and supervised machine learning. Additionally, the results show that in order to achieve accurate detection traffic flows need to be monitored for only a limited time period and number of packets per flow. This indicates a strong potential of using the proposed approach within a future on-line detection framework. Keywords: computer network security ;invasive software; learning (artificial intelligence); peer-to-peer computing; telecommunication traffic; P2P botnets; botnet neutralization; flow-based botnet detection ;flow-based traffic analysis; malicious botnet network traffic identification; nonmalicious applications; packet flow; supervised machine learning; Accuracy; Bayes methods; Feature extraction; Protocols; Support vector machines; Training; Vegetation; Botnet; Botnet detection; Machine learning; Traffic analysis; Traffic classification (ID#:14-2390) URL:
  • Aroussi, S.; Mellouk, A, "Survey on Machine Learning-Based Qoe-Qos Correlation Models," Computing, Management and Telecommunications (ComManTel), 2014 International Conference on, pp.200,204, 27-29 April 2014. doi: 10.1109/ComManTel.2014.6825604 The machine learning provides a theoretical and methodological framework to quantify the relationship between user OoE (Quality of Experience) and network QoS (Quality of Service). This paper presents an overview of QoE-QoS correlation models based on machine learning techniques. According to the learning type, we propose a categorization of correlation models. For each category, we review the main existing works by citing deployed learning methods and model parameters (QoE measurement, QoS parameters and service type). Moreover, the survey will provide researchers with the latest trends and findings in this field. Keywords: learning (artificial intelligence); quality of experience; quality of service; telecommunication computing; QoE measurement; QoE-QoS correlation model ;QoS parameter; QoS service type; machine learning; quality of experience; quality of service; Correlation; Data models; Packet loss; Predictive models; Quality of service; Streaming media; Correlation model; Machine Learning; Quality of Experience; Quality of Service (ID#:14-2391) URL:
  • Alsheikh, M.A; Lin, S.; Niyato, D.; Tan, Hwee-Pink, "Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications," Communications Surveys & Tutorials, IEEE, vol. PP, no.99, pp.1,1, Aapril 2014. doi: 10.1109/COMST.2014.2320099 Wireless sensor networks monitor dynamic environments that change rapidly over time. This dynamic behavior is either caused by external factors or initiated by the system designers themselves. To adapt to such conditions, sensor networks often adopt machine learning techniques to eliminate the need for unnecessary redesign. Machine learning also inspires many practical solutions that maximize resource utilization and prolong the lifespan of the network. In this paper, we present an extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in wireless sensor networks (WSNs). The advantages and disadvantages of each proposed algorithm are evaluated against the corresponding problem. We also provide a comparative guide to aid WSN designers in developing suitable machine learning solutions for their specific application challenges. Keywords: Algorithm design and analysis; Classification algorithms; Clustering algorithms; Machine learning algorithms; Principal component analysis; Routing; Wireless sensor networks (ID#:14-2392) URL:
  • Fangming Ye; Zhaobo Zhang; Chakrabarty, K.; Xinli Gu, "Board-Level Functional Fault Diagnosis Using Multikernel Support Vector Machines and Incremental Learning," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.33, no.2, pp.279,290, Feb. 2014. doi: 10.1109/TCAD.2013.2287184 Advanced machine learning techniques offer an unprecedented opportunity to increase the accuracy of board-level functional fault diagnosis and reduce product cost through successful repair. Ambiguous or incorrect diagnosis results lead to long debug times and even wrong repair actions, which significantly increase repair cost. We propose a smart diagnosis method based on multikernel support vector machines (MK-SVMs) and incremental learning. The MK-SVM method leverages a linear combination of single kernels to achieve accurate faulty-component classification based on the errors observed. The MK-SVMs thus generated can also be updated based on incremental learning, which allows the diagnosis system to quickly adapt to new error observations and provide even more accurate fault diagnosis. Two complex boards from industry, currently in volume production, are used to validate the proposed diagnosis approach in terms of diagnosis accuracy (success rate) and quantifiable improvements over previously proposed machine-learning methods based on several single-kernel SVMs and artificial neural networks. Keywords: {electronic engineering computing; fault diagnosis ;learning (artificial intelligence);neural nets; printed circuit testing; support vector machines; MK-SVM method; advanced machine learning technique; artificial neural network; board level functional fault diagnosis; faulty component classification; linear combination; multikernel support vector machine; smart diagnosis method; Accuracy; Circuit faults; Fault diagnosis; Kernel; Maintenance engineering; Support vector machines; Training; Board-level fault diagnosis; functional failures; incremental learning; kernel; machine learning; support-vector machines (ID#:14-2393) URL:
  • Breuker, D., "Towards Model-Driven Engineering for Big Data Analytics -- An Exploratory Analysis of Domain-Specific Languages for Machine Learning," System Sciences (HICSS), 2014 47th Hawaii International Conference on , vol., no., pp.758,767, 6-9 Jan. 2014. doi: 10.1109/HICSS.2014.101 Graphical models and general purpose inference algorithms are powerful tools for moving from imperative towards declarative specification of machine learning problems. Although graphical models define the principle information necessary to adapt inference algorithms to specific probabilistic models, entirely model-driven development is not yet possible. However, generating executable code from graphical models could have several advantages. It could reduce the skills necessary to implement probabilistic models and may speed up development processes. Both advantages address pressing industry needs. They come along with increased supply of data scientist labor, the demand of which cannot be fulfilled at the moment. To explore the opportunities of model-driven big data analytics, I review the main modeling languages used in machine learning as well as inference algorithms and corresponding software implementations. Gaps hampering direct code generation from graphical models are identified and closed by proposing an initial conceptualization of a domain-specific modeling language. Keywords: Big Data; computer graphics; data analysis; inference mechanisms; learning (artificial intelligence);program compilers; specification languages; big data analytics; direct code generation; domain-specific languages; domain-specific modeling language; general purpose inference algorithms; graphical models; machine learning problems; model-driven development; model-driven engineering; modeling languages; probabilistic models; Adaptation models; Computational modeling; Data models; Graphical models; Inference algorithms; Random variables; Unified modeling language; Graphical Models; Machine Learning; Model-driven Engineering (ID#:14-2394) URL:
  • Aydogan, E.; Sen, S., "Analysis of Machine Learning Methods On Malware Detection," Signal Processing and Communications Applications Conference (SIU), 2014 22nd , vol., no., pp.2066,2069, 23-25 April 2014. doi: 10.1109/SIU.2014.6830667 Nowadays, one of the most important security threats are new, unseen malicious executables. Current anti-virus systems have been fairly successful against known malicious softwares whose signatures are known. However they are very ineffective against new, unseen malicious softwares. In this paper, we aim to detect new, unseen malicious executables using machine learning techniques. We extract distinguishing structural features of softwares and, employ machine learning techniques in order to detect malicious executables. Keywords: invasive software; learning (artificial intelligence); anti-virus systems; machine learning methods; malicious executables detection; malicious softwares; malware detection; security threats; software structural features; Conferences; Internet; Malware; Niobium; Signal processing; Software; machine learning; malware analysis and detection (ID#:14-2395) URL:
  • Kandasamy, K.; Koroth, P., "An Integrated Approach To Spam Classification On Twitter Using URL Analysis, Natural Language Processing And Machine Learning Techniques," Electrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference on , vol., no., pp.1,5, 1-2 March 2014. doi: 10.1109/SCEECS.2014.6804508 In the present day world, people are so much habituated to Social Networks. Because of this, it is very easy to spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which uses an integrated approach to the spam classification in Twitter. The integrated approach comprises the use of URL analysis, natural language processing and supervised machine learning techniques. In short, this is a three step process. Keywords: classification; learning (artificial intelligence) ;natural language processing; social networking (online);unsolicited e-mail; Twitter; URL analysis; natural language processing; social media; social networks ;spam classification; spam contents; supervised machine learning techniques; Accuracy; Machine learning algorithms; Natural language processing; Training; Twitter; Unsolicited electronic mail; URLs; machine learning; natural language processing; tweets (ID#:14-2396) URL:
  • Singh, N.; Chandra, N., "Integrating Machine Learning Techniques to Constitute a Hybrid Security System," Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on , vol., no., pp.1082,1087, 7-9 April 2014. doi: 10.1109/CSNT.2014.221 Computer Security has been discussed and improvised in many forms and using different techniques as well as technologies. The enhancements keep on adding as the security remains the fastest updating unit in a computer system. In this paper we propose a model for securing the system along with the network and enhance it more by applying machine learning techniques SVM (support vector machine) and ANN (Artificial Neural Network). Both the techniques are used together to generate results which are appropriate for analysis purpose and thus, prove to be the milestone for security. Keywords: learning (artificial intelligence); neural nets; security of data; support vector machines; ANN; SVM; artificial neural network; computer security ;hybrid security system; machine learning techniques; support vector machine; Artificial neural networks; Intrusion detection; Neurons; Probabilistic logic; Support vector machines; Training; Artificial neural network; Host logs; Machine Learning; Network logs; Support vector machine}, (ID#:14-2397) URL:
  • Asmitha, K.A; Vinod, P., "A Machine Learning Approach For Linux Malware Detection," Issues and Challenges in Intelligent Computing Techniques (ICICT), 2014 International Conference on , vol., no., pp.825,830, 7-8 Feb. 2014. doi: 10.1109/ICICICT.2014.6781387 The increasing number of malware is becoming a serious threat to the private data as well as to the expensive computer resources. Linux is a Unix based machine and gained popularity in recent years. The malware attack targeting Linux has been increased recently and the existing malware detection methods are insufficient to detect malware efficiently. We are introducing a novel approach using machine learning for identifying malicious Executable Linkable Files. The system calls are extracted dynamically using system call tracer Strace. In this approach we identified best feature set of benign and malware specimens to build classification model that can classify malware and benign efficiently. The experimental results are promising which depict a classification accuracy of 97% to identify malicious samples. Keywords: Linux; invasive software; learning (artificial intelligence);pattern classification; Linux malware detection; Unix based machine; benign specimens; classification model; machine learning approach; malicious executable linkable files identification; malware specimens; system call tracer Strace; Accuracy; Malware; Testing; dynamic analysis; feature selection; system call (ID#:14-2398) URL:
  • Esmalifalak, M.; Liu, L.; Nguyen, N.; Zheng, R.; Han, Z., "Detecting Stealthy False Data Injection Using Machine Learning in Smart Grid," Systems Journal, IEEE , vol. PP, no.99, pp.1,9, August 2014. doi: 10.1109/JSYST.2014.2341597 Aging power industries, together with the increase in demand from industrial and residential customers, are the main incentive for policy makers to define a road map to the next-generation power system called the smart grid. In the smart grid, the overall monitoring costs will be decreased, but at the same time, the risk of cyber attacks might be increased. Recently, a new type of attacks (called the stealth attack) has been introduced, which cannot be detected by the traditional bad data detection using state estimation. In this paper, we show how normal operations of power networks can be statistically distinguished from the case under stealthy attacks. We propose two machine-learning-based techniques for stealthy attack detection. The first method utilizes supervised learning over labeled data and trains a distributed support vector machine (SVM). The design of the distributed SVM is based on the alternating direction method of multipliers, which offers provable optimality and convergence rate. The second method requires no training data and detects the deviation in measurements. In both methods, principal component analysis is used to reduce the dimensionality of the data to be processed, which leads to lower computation complexities. The results of the proposed detection methods on IEEE standard test systems demonstrate the effectiveness of both schemes. Keywords: Anomaly detection; bad data detection (BDD); power system state estimation; support vector machines (SVMs) (ID#:14-2399) URL:


Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to SoS.Project (at) for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.