# Biblio

Found 117 results

Filters: Keyword is pattern classification  [Clear All Filters]
2019-12-02
.  2018.  2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). :1208–1214.
Supervisory control and data acquisition system is an important part of the country's critical infrastructure, but its inherent network characteristics are vulnerable to attack by intruders. The vulnerability of supervisory control and data acquisition system was analyzed, combining common attacks such as information scanning, response injection, command injection and denial of service in industrial control systems, and proposed an intrusion detection model based on graphical features. The time series of message transmission were visualized, extracting the vertex coordinates and various graphic area features to constitute a new data set, and obtained classification model of intrusion detection through training. An intrusion detection experiment environment was built using tools such as MATLAB and power protocol testers. IEC 60870-5-104 protocol which is widely used in power systems had been taken as an example. The results of tests have good effectiveness.
2019-11-26
.  2019.  2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). :1-6.

Phishing as one of the most well-known cybercrime activities is a deception of online users to steal their personal or confidential information by impersonating a legitimate website. Several machine learning-based strategies have been proposed to detect phishing websites. These techniques are dependent on the features extracted from the website samples. However, few studies have actually considered efficient feature selection for detecting phishing attacks. In this work, we investigate an agreement on the definitive features which should be used in phishing detection. We apply Fuzzy Rough Set (FRS) theory as a tool to select most effective features from three benchmarked data sets. The selected features are fed into three often used classifiers for phishing detection. To evaluate the FRS feature selection in developing a generalizable phishing detection, the classifiers are trained by a separate out-of-sample data set of 14,000 website samples. The maximum F-measure gained by FRS feature selection is 95% using Random Forest classification. Also, there are 9 universal features selected by FRS over all the three data sets. The F-measure value using this universal feature set is approximately 93% which is a comparable result in contrast to the FRS performance. Since the universal feature set contains no features from third-part services, this finding implies that with no inquiry from external sources, we can gain a faster phishing detection which is also robust toward zero-day attacks.

2019-11-25
.  2019.  2019 21st International Conference on Advanced Communication Technology (ICACT). :692–696.

In the field of process mining, a lot of information about what happened inside the information system has been exploited and has yielded significant results. However, information related to the relationship between performers and performers is only utilized and evaluated in certain aspects. In this paper, we propose an algorithm to classify the temporal work transference from workflow enactment event log. This result may be used to reduce system memory, increase the computation speed. Furthermore, it can be used as one of the factors to evaluate the performer, active role of resources in the information system.

2019-11-12
.  2019.  2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). :8-14.

The rapid rise of cyber-crime activities and the growing number of devices threatened by them place software security issues in the spotlight. As around 90% of all attacks exploit known types of security issues, finding vulnerable components and applying existing mitigation techniques is a viable practical approach for fighting against cyber-crime. In this paper, we investigate how the state-of-the-art machine learning techniques, including a popular deep learning algorithm, perform in predicting functions with possible security vulnerabilities in JavaScript programs. We applied 8 machine learning algorithms to build prediction models using a new dataset constructed for this research from the vulnerability information in public databases of the Node Security Project and the Snyk platform, and code fixing patches from GitHub. We used static source code metrics as predictors and an extensive grid-search algorithm to find the best performing models. We also examined the effect of various re-sampling strategies to handle the imbalanced nature of the dataset. The best performing algorithm was KNN, which created a model for the prediction of vulnerable functions with an F-measure of 0.76 (0.91 precision and 0.66 recall). Moreover, deep learning, tree and forest based classifiers, and SVM were competitive with F-measures over 0.70. Although the F-measures did not vary significantly with the re-sampling strategies, the distribution of precision and recall did change. No re-sampling seemed to produce models preferring high precision, while re-sampling strategies balanced the IR measures.

2019-08-05
.  2018.  NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1-6.

Anomaly detection on security logs is receiving more and more attention. Authentication events are an important component of security logs, and being able to produce trustful and accurate predictions minimizes the effort of cyber-experts to stop false attacks. Observed events are classified into Normal, for legitimate user behavior, and Malicious, for malevolent actions. These classes are consistently excessively imbalanced which makes the classification problem harder; in the commonly used Los Alamos dataset, the malicious class comprises only 0.00033% of the total. This work proposes a novel method to extract advanced composite features, and a supervised learning technique for classifying authentication logs trustfully; the models are Random Forest, LogitBoost, Logistic Regression, and ultimately Majority Voting which leverages the predictions of the previous models and gives the final prediction for each authentication event. We measure the performance of our experiments by using the False Negative Rate and False Positive Rate. In overall we achieve 0 False Negative Rate (i.e. no attack was missed), and on average a False Positive Rate of 0.0019.

2019-07-01
.  2018.  2018 IEEE 27th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE). :106-111.

Traditional anti-virus technologies have failed to keep pace with proliferation of malware due to slow process of their signatures and heuristics updates. Similarly, there are limitations of time and resources in order to perform manual analysis on each malware. There is a need to learn from this vast quantity of data, containing cyber attack pattern, in an automated manner to proactively adapt to ever-evolving threats. Machine learning offers unique advantages to learn from past cyber attacks to handle future cyber threats. The purpose of this research is to propose a framework for multi-classification of malware into well-known categories by applying different machine learning models over corpus of malware analysis reports. These reports are generated through an open source malware sandbox in an automated manner. We applied extensive pre-modeling techniques for data cleaning, features exploration and features engineering to prepare training and test datasets. Best possible hyper-parameters are selected to build machine learning models. These prepared datasets are then used to train the machine learning classifiers and to compare their prediction accuracy. Finally, these results are validated through a comprehensive 10-fold cross-validation methodology. The best results are achieved through Gaussian Naive Bayes classifier with random accuracy of 96% and 10-Fold Cross Validation accuracy of 91.2%. The said framework can be deployed in an operational environment to learn from malware attacks for proactively adapting matching counter measures.

2019-06-24
.  2018.  MILCOM 2018 - 2018 IEEE Military Communications Conference (MILCOM). :1–8.

Recently researchers have proposed using deep learning-based systems for malware detection. Unfortunately, all deep learning classification systems are vulnerable to adversarial learning-based attacks, or adversarial attacks, where miscreants can avoid detection by the classification algorithm with very few perturbations of the input data. Previous work has studied adversarial attacks against static analysis-based malware classifiers which only classify the content of the unknown file without execution. However, since the majority of malware is either packed or encrypted, malware classification based on static analysis often fails to detect these types of files. To overcome this limitation, anti-malware companies typically perform dynamic analysis by emulating each file in the anti-malware engine or performing in-depth scanning in a virtual machine. These strategies allow the analysis of the malware after unpacking or decryption. In this work, we study different strategies of crafting adversarial samples for dynamic analysis. These strategies operate on sparse, binary inputs in contrast to continuous inputs such as pixels in images. We then study the effects of two, previously proposed defensive mechanisms against crafted adversarial samples including the distillation and ensemble defenses. We also propose and evaluate the weight decay defense. Experiments show that with these three defenses, the number of successfully crafted adversarial samples is reduced compared to an unprotected baseline system. In particular, the ensemble defense is the most resilient to adversarial attacks. Importantly, none of the defenses significantly reduce the classification accuracy for detecting malware. Finally, we show that while adding additional hidden layers to neural models does not significantly improve the malware classification accuracy, it does significantly increase the classifier's robustness to adversarial attacks.

2019-06-10
.  2018.  2018 5th Asian Conference on Defense Technology (ACDT). :8-13.

Early detection of new kinds of malware always plays an important role in defending the network systems. Especially, if intelligent protection systems could themselves detect an existence of new malware types in their system, even with a very small number of malware samples, it must be a huge benefit for the organization as well as the social since it help preventing the spreading of that kind of malware. To deal with learning from few samples, term one-shot learning'' or fewshot learning'' was introduced, and mostly used in computer vision to recognize images, handwriting, etc. An approach introduced in this paper takes advantage of One-shot learning algorithms in solving the malware classification problem by using Memory Augmented Neural Network in combination with malware's API calls sequence, which is a very valuable source of information for identifying malware behavior. In addition, it also use some advantages of the development in Natural Language Processing field such as word2vec, etc. to convert those API sequences to numeric vectors before feeding to the one-shot learning network. The results confirm very good accuracies compared to the other traditional methods.

.  2018.  2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). :330-336.

With the increase in the popularity of computerized online applications, the analysis, and detection of a growing number of newly discovered stealthy malware poses a significant challenge to the security community. Signature-based and behavior-based detection techniques are becoming inefficient in detecting new unknown malware. Machine learning solutions are employed to counter such intelligent malware and allow performing more comprehensive malware detection. This capability leads to an automatic analysis of malware behavior. The proposed oblique random forest ensemble learning technique is efficient for malware classification. The effectiveness of the proposed method is demonstrated with three malware classification datasets from various sources. The results are compared with other variants of decision tree learning models. The proposed system performs better than the existing system in terms of classification accuracy and false positive rate.

.  2018.  2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI). :1-9.

Lately, we are facing the Malware crisis due to various types of malware or malicious programs or scripts available in the huge virtual world - the Internet. But, what is malware? Malware can be a malicious software or a program or a script which can be harmful to the user's computer. These malicious programs can perform a variety of functions, including stealing, encrypting or deleting sensitive data, altering or hijacking core computing functions and monitoring users' computer activity without their permission. There are various entry points for these programs and scripts in the user environment, but only one way to remove them is to find them and kick them out of the system which isn't an easy job as these small piece of script or code can be anywhere in the user system. This paper involves the understanding of different types of malware and how we will use Machine Learning to detect these malwares.

.  2018.  2018 IEEE Symposium on Computers and Communications (ISCC). :00129-00135.

In view of the great threat posed by malware and the rapid growing trend about malware variants, it is necessary to determine the category of new samples accurately for further analysis and taking appropriate countermeasures. The network behavior based classification methods have become more popular now. However, the behavior profiling models they used usually only depict partial network behavior of samples or require specific traffic selection in advance, which may lead to adverse effects on categorizing advanced malware with complex activities. In this paper, to overcome the shortages of traditional models, we raise a comprehensive behavior model for profiling the behavior of malware network activities. And we also propose a corresponding malware classification method which can extract and compare the major behavior of samples. The experimental and comparison results not only demonstrate our method can categorize samples accurately in both criteria, but also prove the advantage of our profiling model to two other approaches in accuracy performance, especially under scenario based criteria.

.  2018.  2018 20th International Conference on Advanced Communication Technology (ICACT). :40-44.

Malware or Malicious Software, are an important threat to information technology society. Deep Neural Network has been recently achieving a great performance for the tasks of malware detection and classification. In this paper, we propose a convolutional gated recurrent neural network model that is capable of classifying malware to their respective families. The model is applied to a set of malware divided into 9 different families and that have been proposed during the Microsoft Malware Classification Challenge in 2015. The model shows an accuracy of 92.6% on the available dataset.

.  2018.  2018 13th International Conference on Malicious and Unwanted Software (MALWARE). :103-111.

Behavioral malware detection aims to improve on the performance of static signature-based techniques used by anti-virus systems, which are less effective against modern polymorphic and metamorphic malware. Behavioral malware classification aims to go beyond the detection of malware by also identifying a malware's family according to a naming scheme such as the ones used by anti-virus vendors. Behavioral malware classification techniques use run-time features, such as file system or network activities, to capture the behavioral characteristic of running processes. The increasing volume of malware samples, diversity of malware families, and the variety of naming schemes given to malware samples by anti-virus vendors present challenges to behavioral malware classifiers. We describe a behavioral classifier that uses a Convolutional Recurrent Neural Network and data from Microsoft Windows Prefetch files. We demonstrate the model's improvement on the state-of-the-art using a large dataset of malware families and four major anti-virus vendor naming schemes. The model is effective in classifying malware samples that belong to common and rare malware families and can incrementally accommodate the introduction of new malware samples and families.

.  2018.  2018 16th Annual Conference on Privacy, Security and Trust (PST). :1-8.

While the rapid adaptation of mobile devices changes our daily life more conveniently, the threat derived from malware is also increased. There are lots of research to detect malware to protect mobile devices, but most of them adopt only signature-based malware detection method that can be easily bypassed by polymorphic and metamorphic malware. To detect malware and its variants, it is essential to adopt behavior-based detection for efficient malware classification. This paper presents a system that classifies malware by using common behavioral characteristics along with malware families. We measure the similarity between malware families with carefully chosen features commonly appeared in the same family. With the proposed similarity measure, we can classify malware by malware's attack behavior pattern and tactical characteristics. Also, we apply community detection algorithm to increase the modularity within each malware family network aggregation. To maintain high classification accuracy, we propose a process to derive the optimal weights of the selected features in the proposed similarity measure. During this process, we find out which features are significant for representing the similarity between malware samples. Finally, we provide an intuitive graph visualization of malware samples which is helpful to understand the distribution and likeness of the malware networks. In the experiment, the proposed system achieved 97% accuracy for malware classification and 95% accuracy for prediction by K-fold cross-validation using the real malware dataset.

.  2018.  IEEE INFOCOM 2018 - IEEE Conference on Computer Communications. :1475-1483.

Better understanding of mobile applications' behaviors would lead to better malware detection/classification and better app recommendation for users. In this work, we design a framework AppDNA to automatically generate a compact representation for each app to comprehensively profile its behaviors. The behavior difference between two apps can be measured by the distance between their representations. As a result, the versatile representation can be generated once for each app, and then be used for a wide variety of objectives, including malware detection, app categorizing, plagiarism detection, etc. Based on a systematic and deep understanding of an app's behavior, we propose to perform a function-call-graph-based app profiling. We carefully design a graph-encoding method to convert a typically extremely large call-graph to a 64-dimension fix-size vector to achieve robust app profiling. Our extensive evaluations based on 86,332 benign and malicious apps demonstrate that our system performs app profiling (thus malware detection, classification, and app recommendation) to a high accuracy with extremely low computation cost: it classifies 4024 (benign/malware) apps using around 5.06 second with accuracy about 93.07%; it classifies 570 malware's family (total 21 families) using around 0.83 second with accuracy 82.3%; it classifies 9,730 apps' functionality with accuracy 33.3% for a total of 7 categories and accuracy of 88.1 % for 2 categories.

.  2018.  2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). :1029-1033.

In this paper we present a new approach, named DLGraph, for malware detection using deep learning and graph embedding. DLGraph employs two stacked denoising autoencoders (SDAs) for representation learning, taking into consideration computer programs' function-call graphs and Windows application programming interface (API) calls. Given a program, we first use a graph embedding technique that maps the program's function-call graph to a vector in a low-dimensional feature space. One SDA in our deep learning model is used to learn a latent representation of the embedded vector of the function-call graph. The other SDA in our model is used to learn a latent representation of the given program's Windows API calls. The two learned latent representations are then merged to form a combined feature vector. Finally, we use softmax regression to classify the combined feature vector for predicting whether the given program is malware or not. Experimental results based on different datasets demonstrate the effectiveness of the proposed approach and its superiority over a related method.

.  2018.  2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). :1-6.

The Machine Type Communication Devices (MTCDs) are usually based on Internet Protocol (IP), which can cause billions of connected objects to be part of the Internet. The enormous amount of data coming from these devices are quite heterogeneous in nature, which can lead to security issues, such as injection attacks, ballot stuffing, and bad mouthing. Consequently, this work considers machine learning trust evaluation as an effective and accurate option for solving the issues associate with security threats. In this paper, a comparative analysis is carried out with five different machine learning approaches: Naive Bayes (NB), Decision Tree (DT), Linear and Radial Support Vector Machine (SVM), KNearest Neighbor (KNN), and Random Forest (RF). As a critical element of the research, the recommendations consider different Machine-to-Machine (M2M) communication nodes with regard to their ability to identify malicious and honest information. To validate the performances of these models, two trust computation measures were used: Receiver Operating Characteristics (ROCs), Precision and Recall. The malicious data was formulated in Matlab. A scenario was created where 50% of the information were modified to be malicious. The malicious nodes were varied in the ranges of 10%, 20%, 30%, 40%, and the results were carefully analyzed.

.  2018.  2018 International Conference on Computing, Power and Communication Technologies (GUCON). :111-114.

This computer era leads human to interact with computers and networks but there is no such solution to get rid of security problems. Securities threats misleads internet, we are sometimes losing our hope and reliability with many server based access. Even though many more crypto algorithms are coming for integrity and authentic data in computer access still there is a non reliable threat penetrates inconsistent vulnerabilities in networks. These vulnerable sites are taking control over the user's computer and doing harmful actions without user's privileges. Though Firewalls and protocols may support our browsers via setting certain rules, still our system couldn't support for data reliability and confidentiality. Since these problems are based on network access, lets we consider TCP/IP parameters as a dataset for analysis. By doing preprocess of TCP/IP packets we can build sovereign model on data set and clump cluster. Further the data set gets classified into regular traffic pattern and anonymous pattern using KNN classification algorithm. Based on obtained pattern for normal and threats data sets, security devices and system will set rules and guidelines to learn by it to take needed stroke. This paper analysis the computer to learn security actions from the given data sets which already exist in the previous happens.

2019-05-08
.  2018.  2018 IEEE Third International Conference on Data Science in Cyberspace (DSC). :576–581.

With the evolution of network threat, identifying threat from internal is getting more and more difficult. To detect malicious insiders, we move forward a step and propose a novel attribute classification insider threat detection method based on long short term memory recurrent neural networks (LSTM-RNNs). To achieve high detection rate, event aggregator, feature extractor, several attribute classifiers and anomaly calculator are seamlessly integrated into an end-to-end detection framework. Using the CERT insider threat dataset v6.2 and threat detection recall as our performance metric, experimental results validate that the proposed threat detection method greatly outperforms k-Nearest Neighbor, Isolation Forest, Support Vector Machine and Principal Component Analysis based threat detection methods.

2019-03-15
.  2018.  2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :1452-1457.

Integrated circuits (ICs) are becoming vulnerable to hardware Trojans. Most of existing works require golden chips to provide references for hardware Trojan detection. However, a golden chip is extremely difficult to obtain. In previous work, we have proposed a classification-based golden chips-free hardware Trojan detection technique. However, the algorithm in the previous work are trained by simulated ICs without considering that there may be a shift which occurs between the simulation and the silicon fabrication. It is necessary to learn from actual silicon fabrication in order to obtain an accurate and effective classification model. We propose a co-training based hardware Trojan detection technique exploiting unlabeled fabricated ICs and inaccurate simulation models, to provide reliable detection capability when facing fabricated ICs, while eliminating the need of fabricated golden chips. First, we train two classification algorithms using simulated ICs. During test-time, the two algorithms can identify different patterns in the unlabeled ICs, and thus be able to label some of these ICs for the further training of the another algorithm. Moreover, we use a statistical examination to choose ICs labeling for the another algorithm in order to help prevent a degradation in performance due to the increased noise in the labeled ICs. We also use a statistical technique for combining the hypotheses from the two classification algorithms to obtain the final decision. The theoretical basis of why the co-training method can work is also described. Experiment results on benchmark circuits show that the proposed technique can detect unknown Trojans with high accuracy (92% 97%) and recall (88% 95%).

.  2018.  2018 IEEE International Conference on Applied System Invention (ICASI). :1107-1110.
In practice, Defenders need a more efficient network detection approach which has the advantages of quick-responding learning capability of new network behavioural features for network intrusion detection purpose. In many applications the capability of Deep Learning techniques has been confirmed to outperform classic approaches. Accordingly, this study focused on network intrusion detection using convolutional neural networks (CNNs) based on LeNet-5 to classify the network threats. The experiment results show that the prediction accuracy of intrusion detection goes up to 99.65% with samples more than 10,000. The overall accuracy rate is 97.53%.
2019-03-06
.  2018.  2018 Design, Automation Test in Europe Conference Exhibition (DATE). :1175-1178.

Deep neural networks (DNNs) are effective machine learning models to solve a large class of recognition problems, including the classification of nonlinearly separable patterns. The applications of DNNs are, however, limited by the large size and high energy consumption of the networks. Recently, stochastic computation (SC) has been considered to implement DNNs to reduce the hardware cost. However, it requires a large number of random number generators (RNGs) that lower the energy efficiency of the network. To overcome these limitations, we propose the design of an energy-efficient deep belief network (DBN) based on stochastic computation. An approximate SC activation unit (A-SCAU) is designed to implement different types of activation functions in the neurons. The A-SCAU is immune to signal correlations, so the RNGs can be shared among all neurons in the same layer with no accuracy loss. The area and energy of the proposed design are 5.27% and 3.31% (or 26.55% and 29.89%) of a 32-bit floating-point (or an 8-bit fixed-point) implementation. It is shown that the proposed SC-DBN design achieves a higher classification accuracy compared to the fixed-point implementation. The accuracy is only lower by 0.12% than the floating-point design at a similar computation speed, but with a significantly lower energy consumption.

.  2018.  2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA). :1-7.

Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a $\chi$2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.

2019-03-04
.  2018.  NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1–15.
Conversational IT services are expected to reduce user wait times and improve overall customer satisfaction. Cloud-based solutions are readily available for enterprise subject matter experts (SMEs) to train user-question classifiers and build conversational services with little effort. However, methodologies that the SMEs can use to improve the response accuracy and conversation quality are merely stated and evaluated. In complex service scenarios such as software support, the scope of topics is typically large and the training samples are often limited. Thus, training the classifier based on labeled samples of plain user utterances is not effective in most cases. In this paper, we identify several methods for improving classification quality and evaluate them in concrete training set scenarios. Particularly, a process-based methodology is described that builds and refines on top of service domain knowledge in order to develop a scalable solution for training accurate conversation services. Enterprises and service providers are continuously seeking new ways to improve customer experience on working with IT systems, where user wait times and service resolution quality are critical business metrics. One of the latest trends is the use of conversational IT services. Customers can interact with a conversational service to express their questions in natural language and the system can automatically return relevant answers or execute back-end processes for automated actions. Various text classification techniques have been developed and applied to understand the user questions and trigger the correct responses. For instance, in the context of IT software support, customers can use conversational systems to get answers about software product errors, licenses, or upgrade processes. While the potential benefits of building conversational services are huge, it is often difficult to effectively train classification models that cover well the scope of realistically complex services. In this paper, we propose a training methodology that addresses the limitations in both the scope of topics and the scarcity of the training set. We further evaluate the proposed methodology in a real service support scenario and share the lessons learned.
.  2018.  2018 IEEE International Conference on Information Reuse and Integration (IRI). :269–276.

At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.