Visible to the public Prioritized active learning for malicious URL detection using weighted text-based features

TitlePrioritized active learning for malicious URL detection using weighted text-based features
Publication TypeConference Paper
Year of Publication2017
AuthorsBhattacharjee, S. Das, Talukder, A., Al-Shaer, E., Doshi, P.
Conference Name2017 IEEE International Conference on Intelligence and Security Informatics (ISI)
Date Publishedjul
ISBN Number 978-1-5090-6727-5
Keywordsactive learning, batch learning framework, classification performance, Collaboration, Computer crime, computer security, cyber-security, cyber-security scenario, data analytics, data annotations, data-driven analytics, feature weight update technique, ground-truth labels, Human Behavior, human-machine collaborative approach, learning (artificial intelligence), machine learning, Malicious threat detection, malicious URL detection, Man-machine systems, Mutual information, natural language processing, pattern classification, phishing, phishing categorization, PhishMonger's Targeted Brand dataset, prioritized active learning, pubcrawl, Resiliency, Scalability, security analytics, supervised security analytics task, text analysis, text analytics, Training, Uniform resource locators, unlabelled data, weighted text-based features

Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.

Citation Keybhattacharjee_prioritized_2017