Visible to the public Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers

TitleClassification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers
Publication TypeConference Paper
Year of Publication2018
AuthorsAborisade, O., Anwar, M.
Conference Name2018 IEEE International Conference on Information Reuse and Integration (IRI)
Date Publishedjul
Keywordsanonymity, attribution, authorisation, authorship attribution, authorship attribution techniques, Bayes methods, classification, composability, digital identities, dubious sources, Electronic mail, fake news, feature extraction, feature vector, Human Behavior, human-in-the-loop security center paradigm, learning (artificial intelligence), logistic regression, logistic regression based classifier, Logistics, machine learning, machine learning model, machine learning techniques, Metrics, Mobile Phone, naive Bayes, naïve Bayes classifier, pattern classification, pre-processed data, privacy, pubcrawl, python, regression analysis, security, social computing, social media, social networking (online), text analysis, text classification, Training, tweets authorship, Twitter, Twitter account

At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.

Citation Keyaborisade_classification_2018