Visible to the public Biblio

Filters: Keyword is dataset  [Clear All Filters]
Conference Paper
Ferenc, Rudolf, Heged\H us, Péter, Gyimesi, Péter, Antal, Gábor, Bán, Dénes, Gyimóthy, Tibor.  2019.  Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions. 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). :8-14.

The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.

Moustafa, Nour, Ahmed, Mohiuddin, Ahmed, Sherif.  2020.  Data Analytics-Enabled Intrusion Detection: Evaluations of ToNİoT Linux Datasets. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :727–735.
With the widespread of Artificial Intelligence (AI)-enabled security applications, there is a need for collecting heterogeneous and scalable data sources for effectively evaluating the performances of security applications. This paper presents the description of new datasets, named ToNİoT datasets that include distributed data sources collected from Telemetry datasets of Internet of Things (IoT) services, Operating systems datasets of Windows and Linux, and datasets of Network traffic. The paper aims to describe the new testbed architecture used to collect Linux datasets from audit traces of hard disk, memory and process. The architecture was designed in three distributed layers of edge, fog, and cloud. The edge layer comprises IoT and network systems, the fog layer includes virtual machines and gateways, and the cloud layer includes data analytics and visualization tools connected with the other two layers. The layers were programmatically controlled using Software-Defined Network (SDN) and Network-Function Virtualization (NFV) using the VMware NSX and vCloud NFV platform. The Linux ToNİoT datasets would be used to train and validate various new federated and distributed AI-enabled security solutions such as intrusion detection, threat intelligence, privacy preservation and digital forensics. Various Data analytical and machine learning methods are employed to determine the fidelity of the datasets in terms of examining feature engineering, statistics of legitimate and security events, and reliability of security events. The datasets can be publicly accessed from [1].
Summers, Cameron, Tronel, Greg, Cramer, Jason, Vartakavi, Aneesh, Popp, Phillip.  2016.  GNMID14: A Collection of 110 Million Global Music Identification Matches. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. :693–696.

A new dataset is presented composed of music identification matches from Gracenote, a leading global music metadata company. Matches from January 1, 2014 to December 31, 2014 have been curated and made available as a public dataset called Gracenote Music Identification 2014, or GNMID14, at the following address: This collection is the first significant music identification dataset and one of the largest music related datasets available containing more than 110M matches in 224 countries for 3M unique tracks, and 509K unique artists. It features geotemporal information (i.e. country and match date), genre and mood metadata. In this paper, we characterize the dataset and demonstrate its utility for Information Retrieval (IR) research.

Shahbar, K., Zincir-Heywood, A. N..  2018.  How Far Can We Push Flow Analysis to Identify Encrypted Anonymity Network Traffic? NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1–6.

Anonymity networks provide privacy to the users by relaying their data to multiple destinations in order to reach the final destination anonymously. Multilayer of encryption is used to protect the users' privacy from attacks or even from the operators of the stations. In this research, we showed how flow analysis could be used to identify encrypted anonymity network traffic under four scenarios: (i) Identifying anonymity networks compared to normal background traffic; (ii) Identifying the type of applications used on the anonymity networks; (iii) Identifying traffic flow behaviors of the anonymity network users; and (iv) Identifying / profiling the users on an anonymity network based on the traffic flow behavior. In order to study these, we employ a machine learning based flow analysis approach and explore how far we can push such an approach.

Coulter, Rory, Zhang, Jun, Pan, Lei, Xiang, Yang.  2020.  Unmasking Windows Advanced Persistent Threat Execution. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :268—276.

The advanced persistent threat (APT) landscape has been studied without quantifiable data, for which indicators of compromise (IoC) may be uniformly analyzed, replicated, or used to support security mechanisms. This work culminates extensive academic and industry APT analysis, not as an incremental step in existing approaches to APT detection, but as a new benchmark of APT related opportunity. We collect 15,259 APT IoC hashes, retrieving subsequent sandbox execution logs across 41 different file types. This work forms an initial focus on Windows-based threat detection. We present a novel Windows APT executable (APT-EXE) dataset, made available to the research community. Manual and statistical analysis of the APT-EXE dataset is conducted, along with supporting feature analysis. We draw upon repeat and common APT paths access, file types, and operations within the APT-EXE dataset to generalize APT execution footprints. A baseline case analysis successfully identifies a majority of 117 of 152 live APT samples from campaigns across 2018 and 2019.

Ahmad, Kashif, Conci, Nicola, Boato, Giulia, De Natale, Francesco G. B..  2016.  USED: A Large-scale Social Event Detection Dataset. Proceedings of the 7th International Conference on Multimedia Systems. :50:1–50:6.

Event discovery from single pictures is a challenging problem that has raised significant interest in the last decade. During this time, a number of interesting solutions have been proposed to tackle event discovery in still images. However, a large scale benchmarking image dataset for the evaluation and comparison of event discovery algorithms from single images is still lagging behind. To this aim, in this paper we provide a large-scale properly annotated and balanced dataset of 490,000 images, covering every aspect of 14 different types of social events, selected among the most shared ones in the social network. Such a large scale collection of event-related images is intended to become a powerful support tool for the research community in multimedia analysis by providing a common benchmark for training, testing, validation and comparison of existing and novel algorithms. In this paper, we provide a detailed description of how the dataset is collected, organized and how it can be beneficial for the researchers in the multimedia analysis domain. Moreover, a deep learning based approach is introduced into event discovery from single images as one of the possible applications of this dataset with a belief that deep learning can prove to be a breakthrough also in this research area. By providing this dataset, we hope to gather research community in the multimedia and signal processing domains to advance this application.