Visible to the public Biblio

Filters: Keyword is text mining  [Clear All Filters]
Bhagat, V., J, B. R..  2020.  Natural Language Processing on Diverse Data Layers Through Microservice Architecture. 2020 IEEE International Conference for Innovation in Technology (INOCON). :1–6.
With the rapid growth in Natural Language Processing (NLP), all types of industries find a need for analyzing a massive amount of data. Sentiment analysis is becoming a more exciting area for the businessmen and researchers in Text mining & NLP. This process includes the calculation of various sentiments with the help of text mining. Supplementary to this, the world is connected through Information Technology and, businesses are moving toward the next step of the development to make their system more intelligent. Microservices have fulfilled the need for development platforms which help the developers to use various development tools (Languages and applications) efficiently. With the consideration of data analysis for business growth, data security becomes a major concern in front of developers. This paper gives a solution to keep the data secured by providing required access to data scientists without disturbing the base system software. This paper has discussed data storage and exchange policies of microservices through common JavaScript Object Notation (JSON) response which performs the sentiment analysis of customer's data fetched from various microservices through secured APIs.
Zhao, F., Skums, P., Zelikovsky, A., Sevigny, E. L., Swahn, M. H., Strasser, S. M., Huang, Y., Wu, Y..  2020.  Computational Approaches to Detect Illicit Drug Ads and Find Vendor Communities Within Social Media Platforms. IEEE/ACM Transactions on Computational Biology and Bioinformatics. :1–1.
The opioid abuse epidemic represents a major public health threat to global populations. The role social media may play in facilitating illicit drug trade is largely unknown due to limited research. However, it is known that social media use among adults in the US is widespread, there is vast capability for online promotion of illegal drugs with delayed or limited deterrence of such messaging, and further, general commercial sale applications provide safeguards for transactions; however, they do not discriminate between legal and illegal sale transactions. These characteristics of the social media environment present challenges to surveillance which is needed for advancing knowledge of online drug markets and the role they play in the drug abuse and overdose deaths. In this paper, we present a computational framework developed to automatically detect illicit drug ads and communities of vendors.The SVM- and CNNbased methods for detecting illicit drug ads, and a matrix factorization based method for discovering overlapping communities have been extensively validated on the large dataset collected from Google+, Flickr and Tumblr. Pilot test results demonstrate that our computational methods can effectively identify illicit drug ads and detect vendor-community with accuracy. These methods hold promise to advance scientific knowledge surrounding the role social media may play in perpetuating the drug abuse epidemic.
Han, H., Wang, Q., Chen, C..  2019.  Policy Text Analysis Based on Text Mining and Fuzzy Cognitive Map. 2019 15th International Conference on Computational Intelligence and Security (CIS). :142—146.
With the introduction of computer methods, the amount of material and processing accuracy of policy text analysis have been greatly improved. In this paper, Text mining(TM) and latent semantic analysis(LSA) were used to collect policy documents and extract policy elements from them. Fuzzy association rule mining(FARM) technique and partial association test (PA) were used to discover the causal relationships and impact degrees between elements, and a fuzzy cognitive map (FCM) was developed to deduct the evolution of elements through a soft computing method. This non-interventionist approach avoids the validity defects caused by the subjective bias of researchers and provides policy makers with more objective policy suggestions from a neutral perspective. To illustrate the accuracy of this method, this study experimented by taking the state-owned capital layout adjustment related policies as an example, and proved that this method can effectively analyze policy text.
Xylogiannopoulos, Konstantinos F., Karampelas, Panagiotis, Alhajj, Reda.  2019.  Text Mining for Malware Classification Using Multivariate All Repeated Patterns Detection. 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). :887—894.

Mobile phones have become nowadays a commodity to the majority of people. Using them, people are able to access the world of Internet and connect with their friends, their colleagues at work or even unknown people with common interests. This proliferation of the mobile devices has also been seen as an opportunity for the cyber criminals to deceive smartphone users and steel their money directly or indirectly, respectively, by accessing their bank accounts through the smartphones or by blackmailing them or selling their private data such as photos, credit card data, etc. to third parties. This is usually achieved by installing malware to smartphones masking their malevolent payload as a legitimate application and advertise it to the users with the hope that mobile users will install it in their devices. Thus, any existing application can easily be modified by integrating a malware and then presented it as a legitimate one. In response to this, scientists have proposed a number of malware detection and classification methods using a variety of techniques. Even though, several of them achieve relatively high precision in malware classification, there is still space for improvement. In this paper, we propose a text mining all repeated pattern detection method which uses the decompiled files of an application in order to classify a suspicious application into one of the known malware families. Based on the experimental results using a real malware dataset, the methodology tries to correctly classify (without any misclassification) all randomly selected malware applications of 3 categories with 3 different families each.

Chung, Wingyan, Liu, Jinwei, Tang, Xinlin, Lai, Vincent S. K..  2018.  Extracting Textual Features of Financial Social Media to Detect Cognitive Hacking. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :244–246.
Social media are increasingly reflecting and influencing the behavior of human and financial market. Cognitive hacking leverages the influence of social media to spread deceptive information with an intent to gain abnormal profits illegally or to cause losses. Measuring the information content in financial social media can be useful for identifying these attacks. In this paper, we developed an approach to identifying social media features that correlate with abnormal returns of the stocks of companies vulnerable to be targets of cognitive hacking. To test the approach, we collected price data and 865,289 social media messages on four technology companies from July 2017 to June 2018, and extracted features that contributed to abnormal stock movements. Preliminary results show that terms that are simple, motivate actions, incite emotion, and uses exaggeration are ranked high in the features of messages associated with abnormal price movements. We also provide selected messages to illustrate the use of these features in potential cognitive hacking attacks.
Srisopha, Kamonphop, Phonsom, Chukiat, Lin, Keng, Boehm, Barry.  2019.  Same App, Different Countries: A Preliminary User Reviews Study on Most Downloaded iOS Apps. 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). :76—80.
Prior work on mobile app reviews has demonstrated that user reviews contain a wealth of information and are seen as a potential source of requirements. However, most of the studies done in this area mainly focused on mining and analyzing user reviews from the US App Store, leaving reviews of users from other countries unexplored. In this paper, we seek to understand if the perception of the same apps between users from other countries and that from the US differs through analyzing user reviews. We retrieve 300,643 user reviews of the 15 most downloaded iOS apps of 2018, published directly by Apple, from nine English-speaking countries over the course of 5 months. We manually classify 3,358 reviews into several software quality and improvement factors. We leverage a random forest based algorithm to identify factors that can be used to differentiate reviews between the US and other countries. Our preliminary results show that all countries have some factors that are proportionally inconsistent with the US.
Bakhtin, Vadim V., Isaeva, Ekaterina V..  2019.  New TSBuilder: Shifting towards Cognition. 2019 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus). :179–181.
The paper reviews a project on the automation of term system construction. TSBuilder (Term System Builder) was developed in 2014 as a multilayer Rosenblatt's perceptron for supervised machine learning, namely 1-3 word terms identification in natural language texts and their rigid categorization. The program is being modified to reduce the rigidity of categorization which will bring text mining more in line with human thinking.We are expanding the range of parameters (semantical, morphological, and syntactical) for categorization, removing the restriction of the term length of three words, using convolution on a continuous sequence of terms, and present the probabilities of a term falling into different categories. The neural network will not assign a single category to a term but give N answers (where N is the number of predefined classes), each of which O ∈ [0, 1] is the probability of the term to belong to a given class.
Rodriguez, Ariel, Okamura, Koji.  2019.  Generating Real Time Cyber Situational Awareness Information Through Social Media Data Mining. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 2:502–507.
With the rise of the internet many new data sources have emerged that can be used to help us gain insights into the cyber threat landscape and can allow us to better prepare for cyber attacks before they happen. With this in mind, we present an end to end real time cyber situational awareness system which aims to efficiently retrieve security relevant information from the social networking site This system classifies and aggregates the data retrieved and provides real time cyber situational awareness information based on sentiment analysis and data analytics techniques. This research will assist security analysts to evaluate the level of cyber risk in their organization and proactively take actions to plan and prepare for potential attacks before they happen as well as contribute to the field through a cybersecurity tweet dataset.
Jackson, K. A., Bennett, B. T..  2018.  Locating SQL Injection Vulnerabilities in Java Byte Code Using Natural Language Techniques. SoutheastCon 2018. :1-5.

With so much our daily lives relying on digital devices like personal computers and cell phones, there is a growing demand for code that not only functions properly, but is secure and keeps user data safe. However, ensuring this is not such an easy task, and many developers do not have the required skills or resources to ensure their code is secure. Many code analysis tools have been written to find vulnerabilities in newly developed code, but this technology tends to produce many false positives, and is still not able to identify all of the problems. Other methods of finding software vulnerabilities automatically are required. This proof-of-concept study applied natural language processing on Java byte code to locate SQL injection vulnerabilities in a Java program. Preliminary findings show that, due to the high number of terms in the dataset, using singular decision trees will not produce a suitable model for locating SQL injection vulnerabilities, while random forest structures proved more promising. Still, further work is needed to determine the best classification tool.

Husari, G., Niu, X., Chu, B., Al-Shaer, E..  2018.  Using Entropy and Mutual Information to Extract Threat Actions from Cyber Threat Intelligence. 2018 IEEE International Conference on Intelligence and Security Informatics (ISI). :1–6.
With the rapid growth of the cyber attacks, cyber threat intelligence (CTI) sharing becomes essential for providing advance threat notice and enabling timely response to cyber attacks. Our goal in this paper is to develop an approach to extract low-level cyber threat actions from publicly available CTI sources in an automated manner to enable timely defense decision making. Specifically, we innovatively and successfully used the metrics of entropy and mutual information from Information Theory to analyze the text in the cybersecurity domain. Combined with some basic NLP techniques, our framework, called ActionMiner has achieved higher precision and recall than the state-of-the-art Stanford typed dependency parser, which usually works well in general English but not cybersecurity texts.
Georgakopoulos, Spiros V., Tasoulis, Sotiris K., Vrahatis, Aristidis G., Plagianakos, Vassilis P..  2018.  Convolutional Neural Networks for Toxic Comment Classification. Proceedings of the 10th Hellenic Conference on Artificial Intelligence. :35:1-35:6.
Flood of information is produced in a daily basis through the global internet usage arising from the online interactive communications among users. While this situation contributes significantly to the quality of human life, unfortunately it involves enormous dangers, since online texts with high toxicity can cause personal attacks, online harassment and bullying behaviors. This has triggered both industrial and research community in the last few years while there are several attempts to identify an efficient model for online toxic comment prediction. However, these steps are still in their infancy and new approaches and frameworks are required. On parallel, the data explosion that appears constantly, makes the construction of new machine learning computational tools for managing this information, an imperative need. Thankfully advances in hardware, cloud computing and big data management allow the development of Deep Learning approaches appearing very promising performance so far. For text classification in particular the use of Convolutional Neural Networks (CNN) have recently been proposed approaching text analytics in a modern manner emphasizing in the structure of words in a document. In this work, we employ this approach to discover toxic comments in a large pool of documents provided by a current Kaggle's competition regarding Wikipedia's talk page edits. To justify this decision we choose to compare CNNs against the traditional bag-of-words approach for text analysis combined with a selection of algorithms proven to be very effective in text classification. The reported results provide enough evidence that CNN enhance toxic comment classification reinforcing research interest towards this direction.
Eclarin, Bobby A., Fajardo, Arnel C., Medina, Ruji P..  2018.  A Novel Feature Hashing With Efficient Collision Resolution for Bag-of-Words Representation of Text Data. Proceedings of the 2Nd International Conference on Natural Language Processing and Information Retrieval. :12-16.
Text Mining is widely used in many areas transforming unstructured text data from all sources such as patients' record, social media network, insurance data, and news, among others into an invaluable source of information. The Bag Of Words (BoW) representation is a means of extracting features from text data for use in modeling. In text classification, a word in a document is assigned a weight according to its frequency and frequency between different documents; therefore, words together with their weights form the BoW. One way to solve the issue of voluminous data is to use the feature hashing method or hashing trick. However, collision is inevitable and might change the result of the whole process of feature generation and selection. Using the vector data structure, the lookup performance is improved while resolving collision and the memory usage is also efficient.
Spanos, Georgios, Angelis, Lefteris, Toloudis, Dimitrios.  2017.  Assessment of Vulnerability Severity Using Text Mining. Proceedings of the 21st Pan-Hellenic Conference on Informatics. :49:1–49:6.

Software1 vulnerabilities are closely associated with information systems security, a major and critical field in today's technology. Vulnerabilities constitute a constant and increasing threat for various aspects of everyday life, especially for safety and economy, since the social impact from the problems that they cause is complicated and often unpredictable. Although there is an entire research branch in software engineering that deals with the identification and elimination of vulnerabilities, the growing complexity of software products and the variability of software production procedures are factors contributing to the ongoing occurrence of vulnerabilities, Hence, another area that is being developed in parallel focuses on the study and management of the vulnerabilities that have already been reported and registered in databases. The information contained in such databases includes, a textual description and a number of metrics related to vulnerabilities. The purpose of this paper is to investigate to what extend the assessment of the vulnerability severity can be inferred directly from the corresponding textual description, or in other words, to examine the informative power of the description with respect to the vulnerability severity. For this purpose, text mining techniques, i.e. text analysis and three different classification methods (decision trees, neural networks and support vector machines) were employed. The application of text mining to a sample of 70,678 vulnerabilities from a public data source shows that the description itself is a reliable and highly accurate source of information for vulnerability prioritization.

Nembhard, F., Carvalho, M., Eskridge, T..  2017.  A Hybrid Approach to Improving Program Security. 2017 IEEE Symposium Series on Computational Intelligence (SSCI). :1–8.

The security of computer programs and systems is a very critical issue. With the number of attacks launched on computer networks and software, businesses and IT professionals are taking steps to ensure that their information systems are as secure as possible. However, many programmers do not think about adding security to their programs until their projects are near completion. This is a major mistake because a system is as secure as its weakest link. If security is viewed as an afterthought, it is highly likely that the resulting system will have a large number of vulnerabilities, which could be exploited by attackers. One of the reasons programmers overlook adding security to their code is because it is viewed as a complicated or time-consuming process. This paper presents a tool that will help programmers think more about security and add security tactics to their code with ease. We created a model that learns from existing open source projects and documentation using machine learning and text mining techniques. Our tool contains a module that runs in the background to analyze code as the programmer types and offers suggestions of where security could be included. In addition, our tool fetches existing open source implementations of cryptographic algorithms and sample code from repositories to aid programmers in adding security easily to their projects.

Zhang, Y., Mao, W., Zeng, D..  2017.  Topic Evolution Modeling in Social Media Short Texts Based on Recurrent Semantic Dependent CRP. 2017 IEEE International Conference on Intelligence and Security Informatics (ISI). :119–124.

Social media has become an important platform for people to express opinions, share information and communicate with others. Detecting and tracking topics from social media can help people grasp essential information and facilitate many security-related applications. As social media texts are usually short, traditional topic evolution models built based on LDA or HDP often suffer from the data sparsity problem. Recently proposed topic evolution models are more suitable for short texts, but they need to manually specify topic number which is fixed during different time period. To address these issues, in this paper, we propose a nonparametric topic evolution model for social media short texts. We first propose the recurrent semantic dependent Chinese restaurant process (rsdCRP), which is a nonparametric process incorporating word embeddings to capture semantic similarity information. Then we combine rsdCRP with word co-occurrence modeling and build our short-text oriented topic evolution model sdTEM. We carry out experimental studies on Twitter dataset. The results demonstrate the effectiveness of our method to monitor social media topic evolution compared to the baseline methods.

Xylogiannopoulos, K., Karampelas, P., Alhajj, R..  2017.  Text Mining in Unclean, Noisy or Scrambled Datasets for Digital Forensics Analytics. 2017 European Intelligence and Security Informatics Conference (EISIC). :76–83.

In our era, most of the communication between people is realized in the form of electronic messages and especially through smart mobile devices. As such, the written text exchanged suffers from bad use of punctuation, misspelling words, continuous chunk of several words without spaces, tables, internet addresses etc. which make traditional text analytics methods difficult or impossible to be applied without serious effort to clean the dataset. Our proposed method in this paper can work in massive noisy and scrambled texts with minimal preprocessing by removing special characters and spaces in order to create a continuous string and detect all the repeated patterns very efficiently using the Longest Expected Repeated Pattern Reduced Suffix Array (LERP-RSA) data structure and a variant of All Repeated Patterns Detection (ARPaD) algorithm. Meta-analyses of the results can further assist a digital forensics investigator to detect important information to the chunk of text analyzed.

Stuckman, J., Walden, J., Scandariato, R..  2017.  The Effect of Dimensionality Reduction on Software Vulnerability Prediction Models. IEEE Transactions on Reliability. 66:17–37.

Statistical prediction models can be an effective technique to identify vulnerable components in large software projects. Two aspects of vulnerability prediction models have a profound impact on their performance: 1) the features (i.e., the characteristics of the software) that are used as predictors and 2) the way those features are used in the setup of the statistical learning machinery. In a previous work, we compared models based on two different types of features: software metrics and term frequencies (text mining features). In this paper, we broaden the set of models we compare by investigating an array of techniques for the manipulation of said features. These techniques fall under the umbrella of dimensionality reduction and have the potential to improve the ability of a prediction model to localize vulnerabilities. We explore the role of dimensionality reduction through a series of cross-validation and cross-project prediction experiments. Our results show that in the case of software metrics, a dimensionality reduction technique based on confirmatory factor analysis provided an advantage when performing cross-project prediction, yielding the best F-measure for the predictions in five out of six cases. In the case of text mining, feature selection can make the prediction computationally faster, but no dimensionality reduction technique provided any other notable advantage.

Zhou, G., Huang, J. X..  2017.  Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval. IEEE Transactions on Knowledge and Data Engineering. 29:1226–1239.

Community question answering (cQA) has become an important issue due to the popularity of cQA archives on the Web. This paper focuses on addressing the lexical gap problem in question retrieval. Question retrieval in cQA archives aims to find the existing questions that are semantically equivalent or relevant to the queried questions. However, the lexical gap problem brings a new challenge for question retrieval in cQA. In this paper, we propose to model and learn distributed word representations with metadata of category information within cQA pages for question retrieval using two novel category powered models. One is a basic category powered model called MB-NET and the other one is an enhanced category powered model called ME-NET which can better learn the distributed word representations and alleviate the lexical gap problem. To deal with the variable size of word representation vectors, we employ the framework of fisher kernel to transform them into the fixed-length vectors. Experimental results on large-scale English and Chinese cQA data sets show that our proposed approaches can significantly outperform state-of-the-art retrieval models for question retrieval in cQA. Moreover, we further conduct our approaches on large-scale automatic evaluation experiments. The evaluation results show that promising and significant performance improvements can be achieved.

Zhu, Ziyun, Dumitras, Tudor.  2016.  FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :767–778.

Malware detection increasingly relies on machine learning techniques, which utilize multiple features to separate the malware from the benign apps. The effectiveness of these techniques primarily depends on the manual feature engineering process, based on human knowledge and intuition. However, given the adversaries' efforts to evade detection and the growing volume of publications on malware behaviors, the feature engineering process likely draws from a fraction of the relevant knowledge. We propose an end-to-end approach for automatic feature engineering. We describe techniques for mining documents written in natural language (e.g. scientific papers) and for representing and querying the knowledge about malware in a way that mirrors the human feature engineering process. Specifically, we first identify abstract behaviors that are associated with malware, and then we map these behaviors to concrete features that can be tested experimentally. We implement these ideas in a system called FeatureSmith, which generates a feature set for detecting Android malware. We train a classifier using these features on a large data set of benign and malicious apps. This classifier achieves a 92.5% true positive rate with only 1% false positives, which is comparable to the performance of a state-of-the-art Android malware detector that relies on manually engineered features. In addition, FeatureSmith is able to suggest informative features that are absent from the manually engineered set and to link the features generated to abstract concepts that describe malware behaviors.

Hyun, Yoonjin, Kim, Namgyu.  2016.  Detecting Blog Spam Hashtags Using Topic Modeling. Proceedings of the 18th Annual International Conference on Electronic Commerce: E-Commerce in Smart Connected World. :43:1–43:6.

Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, attempts to gain profits by distorting text data maliciously or non-maliciously are also increasing. In this sense, various types of spam detection techniques have been studied to prevent the side effects of spamming. The most representative studies include e-mail spam detection, web spam detection, and opinion spam detection. "Spam" is recognized on the basis of three characteristics and actions: (1) if a certain user is recognized as a spammer, then all content created by that user should be recognized as spam; (2) if certain content is exposed to other users (regardless of the users' intention), then content is recognized as spam; and (3) any content that contains malicious or non-malicious false information is recognized as spam. Many studies have been performed to solve type (1) and type (2) spamming by analyzing various metadata, such as user networks and spam words. In the case of type (3), however, relatively few studies have been conducted because it is difficult to determine the veracity of a certain word or information. In this study, we regard a hashtag that is irrelevant to the content of a blog post as spam and devise a methodology to detect such spam hashtags.

Sheeba, J. I., Devaneyan, S. Pradeep.  2016.  Recommendation of Keywords Using Swarm Intelligence Techniques. Proceedings of the International Conference on Informatics and Analytics. :8:1–8:5.

Text mining has developed and emerged as an essential tool for revealing the hidden value in the data. Text mining is an emerging technique for companies around the world and suitable for large enduring analyses and discrete investigations. Since there is a need to track disrupting technologies, explore internal knowledge bases or review enormous data sets. Most of the information produced due to conversation transcripts is an unstructured format. These data have ambiguity, redundancy, duplications, typological errors and many more. The processing and analysis of these unstructured data are difficult task. But, there are several techniques in text mining are available to extract keywords from these unstructured conversation transcripts. Keyword Extraction is the process of examining the most significant word in the context which helps to take decisions in a much faster manner. The main objective of the proposed work is extracting the keywords from meeting transcripts by using the Swarm Intelligence (SI) techniques. Here Stochastic Diffusion Search (SDS) algorithm is used for keyword extraction and Firefly algorithm used for clustering. These techniques will be implemented for an extensive range of optimization problems and produced better results when compared with existing technique.

Ren, Xiang, El-Kishky, Ahmed, Ji, Heng, Han, Jiawei.  2016.  Automatic Entity Recognition and Typing in Massive Text Data. Proceedings of the 2016 International Conference on Management of Data. :2235–2239.

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

Koch, S., John, M., Worner, M., Muller, A., Ertl, T..  2014.  VarifocalReader #x2014; In-Depth Visual Analysis of Large Text Documents. Visualization and Computer Graphics, IEEE Transactions on. 20:1723-1732.

Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.

Eun Hee Ko, Klabjan, D..  2014.  Semantic Properties of Customer Sentiment in Tweets. Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on. :657-663.

An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discover semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.