Visible to the public Text Analytics Techniques (2014 Year in Review)

SoS Newsletter- Advanced Book Block


SoS Newsletter Logo

Text Analytics Techniques
(2014 Year in Review)


Text analytics refers to linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for intelligence, exploratory data analysis, research, or investigation. The research cited here focuses on large volumes of text mined to identify insider threats, intrusions, and malware detection.  The works cited were published in 2014.  


Dey, L.; Mahajan, D.; Gupta, H., "Obtaining Technology Insights from Large and Heterogeneous Document Collections," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on, vol. 1, no., pp. 102, 109, 11-14 Aug. 2014. doi: 10.1109/WI-IAT.2014.22 Keeping up with rapid advances in research in various fields of Engineering and Technology is a challenging task. Decision makers including academics, program managers, venture capital investors, industry leaders and funding agencies not only need to be abreast of latest developments but also be able to assess the effect of growth in certain areas on their core business. Though analyst agencies like Gartner, McKinsey etc. Provide such reports for some areas, thought leaders of all organisations still need to amass data from heterogeneous collections like research publications, analyst reports, patent applications, competitor information etc. To help them finalize their own strategies. Text mining and data analytics researchers have been looking at integrating statistics, text analytics and information visualization to aid the process of retrieval and analytics. In this paper, we present our work on automated topical analysis and insight generation from large heterogeneous text collections of publications and patents. While most of the earlier work in this area provides search-based platforms, ours is an integrated platform for search and analysis. We have presented several methods and techniques that help in analysis and better comprehension of search results. We have also presented methods for generating insights about emerging and popular trends in research along with contextual differences between academic research and patenting profiles. We also present novel techniques to present topic evolution that helps users understand how a particular area has evolved over time.

Keywords: data analysis; information retrieval; patents; text analysis; academic research; automated topical analysis; heterogeneous document collections; insight generation; large heterogeneous text collections; patenting profiles; publications; topic evolution; Context; Data mining; Data visualization; Hidden Markov models; Indexing; Market research; Patents; analyzing research trends; mining patent databases; mining publications (ID#:15-3757)



Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T., "Word Cloud Explorer: Text Analytics Based on Word Clouds," System Sciences (HICSS), 2014 47th Hawaii International Conference on, pp. 1833, 1842, 6-9 Jan. 2014. doi: 10.1109/HICSS.2014.231 Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.

Keywords: data visualisation; natural language processing; text analysis; context information; natural language processing; sophisticated interaction techniques; text analysis tasks; text analytics; text summarization; visualization method; visualization paradigm; word cloud explorer; word clouds; Context; Layout; Pragmatics; Tag clouds; Text analysis; User interfaces; Visualization; interaction; natural language processing; tag clouds; text analytics; visualization; word cloud explorer; word clouds (ID#:15-3758) URL:


Mukkamala, R.R.; Hussain, A.; Vatrapu, R., "Towards a Set Theoretical Approach to Big Data Analytics," Big Data (BigData Congress), 2014 IEEE International Congress on, pp. 629, 636, June 27 2014-July 2 2014. doi: 10.1109/BigData.Congress.2014.96 Formal methods, models and tools for social big data analytics are largely limited to graph theoretical approaches such as social network analysis (SNA) informed by relational sociology. There are no other unified modeling approaches to social big data that integrate the conceptual, formal and software realms. In this paper, we first present and discuss a theory and conceptual model of social data. Second, we outline a formal model based on set theory and discuss the semantics of the formal model with a real-world social data example from Facebook. Third, we briefly present and discuss the Social Data Analytics Tool (SODATO) that realizes the conceptual model in software and provisions social data analysis based on the conceptual and formal models. Fourth and last, based on the formal model and sentiment analysis of text, we present a method for profiling of artifacts and actors and apply this technique to the data analysis of big social data collected from Facebook page of the fast fashion company, H&M.

Keywords: Big Data; data analysis; set theory; social networking (online);text analysis; Facebook; Facebook page; H&M; SODATO; conceptual model; fast fashion company; formal model; graph theoretical approach; relational sociology; set theoretical approach; social big data analytics; social data analytic tool; social network analysis; text sentiment analysis; Analytical models; Data models; Facebook; Mathematical model; Media; Tagging; Big Social Data; Computational Social Science; Data Science; Formal Methods; Social Data Analytics}   (ID#: 15-3759)



Koyanagi, T.; Shinjo, Y., "A Fast And Compact Hybrid Memory Resident Datastore For Text Analytics With Autonomic Memory Allocation," Information and Communication Systems (ICICS), 2014 5th International Conference on, pp.1,7, 1-3 April 2014. doi: 10.1109/IACS.2014.6841955 This paper describes a high-performance and space-efficient memory-resident datastore for text analytics systems based on a hash table for fast access, a dynamic trie for staging and a list of Level-Order Unary Degree Sequence (LOUDS) tries for compactness. We achieve efficient memory allocation and data placement by placing freqently access keys in the hash table, and infrequently accessed keys in the LOUDS tries without using conventional cache algorithms. Our algorithm also dynamically changes memory allocation sizes for these data structures according to the remaining available memory size. This technique yields 38.6% to 52.9% better throughput than a double array trie - a conventional fast and compact datastore.

Keywords: storage management; text analysis; tree data structures; LOUDS tries; autonomic memory allocation; data placement; data structures; double array trie; dynamic trie; hash table; high-performance memory-resident datastore; hybrid memory resident datastore; level-order unary degree sequence tries; space-efficient memory-resident datastore; text analytics; Buffer storage; Cows; SDRAM; Switches  (ID#: 15-3760)



Craig, P.; Roa Seiler, N.; Olvera Cervantes, A.D., "Animated Geo-temporal Clusters for Exploratory Search in Event Data Document Collections," Information Visualisation (IV), 2014 18th International Conference on,  pp.157,163, 16-18 July 2014. doi: 10.1109/IV.2014.69 This paper presents a novel visual analytics technique developed to support exploratory search tasks for event data document collections. The technique supports discovery and exploration by clustering results and overlaying cluster summaries onto coordinated timeline and map views. Users can also explore and interact with search results by selecting clusters to filter and re-cluster the data with animation used to smooth the transition between views. The technique demonstrates a number of advantages over alternative methods for displaying and exploring geo-referenced search results and spatio-temporal data. Firstly, cluster summaries can be presented in a manner that makes them easy to read and scan. Listing representative events from each cluster also helps the process of discovery by preserving the diversity of results. Also, clicking on visual representations of geo-temporal clusters provides a quick and intuitive way to navigate across space and time simultaneously. This removes the need to overload users with the display of too many event labels at any one time. The technique was evaluated with a group of nineteen users and compared with an equivalent text based exploratory search engine.

Keywords: computer animation; data visualisation; document handling; document image processing; information retrieval; pattern clustering; animated geo-temporal clusters; animation; coordinated timeline; equivalent text based exploratory search engine; event data document collections; geo-referenced search results; map views; spatio-temporal data; visual analytics technique; Data visualization; Electronic publishing; Encyclopedias; History; Internet; Navigation; human-computer information retrieval; information visualisation; visual analytics  (ID#: 15-3761)



Eun Hee Ko; Klabjan, D., "Semantic Properties of Customer Sentiment in Tweets," Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on, pp.657,663, 13-16 May 2014. doi: 10.1109/WAINA.2014.151 An increasing number of people are using online social networking services (SNSs), and a significant amount of information related to experiences in consumption is shared in this new media form. Text mining is an emerging technique for mining useful information from the web. We aim at discovering in particular tweets semantic patterns in consumers' discussions on social media. Specifically, the purposes of this study are twofold: 1) finding similarity and dissimilarity between two sets of textual documents that include consumers' sentiment polarities, two forms of positive vs. negative opinions and 2) driving actual content from the textual data that has a semantic trend. The considered tweets include consumers' opinions on US retail companies (e.g., Amazon, Walmart). Cosine similarity and K-means clustering methods are used to achieve the former goal, and Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, is used for the latter purpose. This is the first study which discovesr semantic properties of textual data in consumption context beyond sentiment analysis. In addition to major findings, we apply LDA (Latent Dirichlet Allocations) to the same data and drew latent topics that represent consumers' positive opinions and negative opinions on social media.

Keywords: consumer behaviour; data mining; pattern clustering; retail data processing; social networking (online);text analysis; K-means clustering methods; Twitter; US retail companies; consumer opinions; consumer sentiment polarities; cosine similarity; customer sentiment semantic properties; latent Dirichlet allocation; online social networking services; sentiment analysis; text mining; textual data semantic properties; textual documents; topic modeling algorithm; tweet semantic patterns; Business; Correlation; Data mining; Media; Semantics; Tagging; Vectors; text analytics; tweet analysis; document similarity; clustering; topic modeling; part-of-speech tagging  (ID#: 15-3762)



Conglei Shi; Yingcai Wu; Shixia Liu; Hong Zhou; Huamin Qu, "LoyalTracker: Visualizing Loyalty Dynamics in Search Engines," Visualization and Computer Graphics, IEEE Transactions on, vol.20, no.12, pp. 1733, 1742, Dec. 31 2014. doi: 10.1109/TVCG.2014.2346912 The huge amount of user log data collected by search engine providers creates new opportunities to understand user loyalty and defection behavior at an unprecedented scale. However, this also poses a great challenge to analyze the behavior and glean insights into the complex, large data. In this paper, we introduce LoyalTracker, a visual analytics system to track user loyalty and switching behavior towards multiple search engines from the vast amount of user log data. We propose a new interactive visualization technique (flow view) based on a flow metaphor, which conveys a proper visual summary of the dynamics of user loyalty of thousands of users over time. Two other visualization techniques, a density map and a word cloud, are integrated to enable analysts to gain further insights into the patterns identified by the flow view. Case studies and the interview with domain experts are conducted to demonstrate the usefulness of our technique in understanding user loyalty and switching behavior in search engines.

Keywords: data analysis; data visualisation; human factors; search engines; text analysis; LoyalTracker; defection behavior; density map; flow metaphor; flow view; interactive visualization technique; loyalty dynamics visualization; search engine providers; switching behavior; user log data; user loyalty tracking; visual analytics system; word cloud; Behavioral science; Data visualization; Information analysis; Search engines; Search methods; Visual analytics; Time-series visualization; log data visualization; stacked graphs; text visualization  (ID#: 15-3763)



Babour, A.; Khan, J.I., "Tweet Sentiment Analytics with Context Sensitive Tone-Word Lexicon," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on, vol. 1, pp.392,399, 11-14 Aug. 2014. doi: 10.1109/WI-IAT.2014.61 In this paper we propose a twitter sentiment analytics that mines for opinion polarity about a given topic. Most of current semantic sentiment analytics depends on polarity lexicons. However, many key tone words are frequently bipolar. In this paper we demonstrate a technique which can accommodate the bipolarity of tone words by context sensitive tone lexicon learning mechanism where the context is modeled by the semantic neighborhood of the main target. Performance analysis shows that ability to contextualize the tone word polarity significantly improves the accuracy.

Keywords: data mining; learning (artificial intelligence); natural language processing; social networking (online);text analysis; word processing; context sensitive tone lexicon learning mechanism; opinion polarity mining; tone word polarity; tweet sentiment analytics; twitter sentiment analytics; Accuracy; Cameras; Context; Dictionaries; Semantics; Sentiment analysis  (ID#: 15-3764)



Vantigodi, S.; Babu, R.V., "Entropy Constrained Exemplar-Based Image Inpainting," Signal Processing and Communications (SPCOM), 2014 International Conference on, pp.1,5, 22-25 July 2014. doi: 10.1109/SPCOM.2014.6984013 Image inpainting is the process of filling the unwanted region in an image marked by the user. It is used for restoring old paintings and photographs, removal of red eyes from pictures, etc. In this paper, we propose an efficient inpainting algorithm which takes care of false edge propagation. We use the classical exemplar based technique to find out the priority term for each patch. To ensure that the edge content of the nearest neighbor patch found by minimizing L2 distance between patches, we impose an additional constraint that the entropy of the patches be similar. Entropy of the patch acts as a good measure of edge content. Additionally, we fill the image by considering overlapping patches to ensure smoothness in the output. We use structural similarity index as the measure of similarity between ground truth and inpainted image. The results of the proposed approach on a number of examples on real and synthetic images show the effectiveness of our algorithm in removing objects and thin scratches or text written on image. It is also shown that the proposed approach is robust to the shape of the manually selected target. Our results compare favorably to those obtained by existing techniques.

Keywords: edge detection; entropy; image restoration; entropy constrained exemplar-based image inpainting; false edge propagation; old painting restoration; photograph restoration; structural similarity index; Entropy; Equations; Image color analysis; Image edge detection; Image reconstruction; Image restoration; PSNR  (ID#: 15-3765)



Baughman, A.K.; Chuang, W.; Dixon, K.R.; Benz, Z.; Basilico, J., "DeepQA Jeopardy! Gamification: A Machine-Learning Perspective," Computational Intelligence and AI in Games, IEEE Transactions on, vol.6, no. 1, pp. 55, 66, March 2014. doi: 10.1109/TCIAIG.2013.2285651 DeepQA is a large-scale natural language processing (NLP) question-and-answer system that responds across a breadth of structured and unstructured data, from hundreds of analytics that are combined with over 50 models, trained through machine learning. After the 2011 historic milestone of defeating the two best human players in the Jeopardy! game show, the technology behind IBM Watson, DeepQA, is undergoing gamification into real-world business problems. Gamifying a business domain for Watson is a composite of functional, content, and training adaptation for nongame play. During domain gamification for medical, financial, government, or any other business, each system change affects the machine-learning process. As opposed to the original Watson Jeopardy!, whose class distribution of positive-to-negative labels is 1:100, in adaptation the computed training instances, question-and-answer pairs transformed into true-false labels, result in a very low positive-to-negative ratio of 1:100 000. Such initial extreme class imbalance during domain gamification poses a big challenge for the Watson machine-learning pipelines. The combination of ingested corpus sets, question-and-answer pairs, configuration settings, and NLP algorithms contribute toward the challenging data state. We propose several data engineering techniques, such as answer key vetting and expansion, source ingestion, oversampling classes, and question set modifications to increase the computed true labels. In addition, algorithm engineering, such as an implementation of the Newton-Raphson logistic regression with a regularization term, relaxes the constraints of class imbalance during training adaptation. We conclude by empirically demonstrating that data and algorithm engineering are complementary and indispensable to overcome the challenges in this first Watson gamification for real-world business problems.

Keywords: business data processing; computer games; learning (artificial intelligence); natural language processing; question answering (information retrieval);text analysis; DeepQA Jeopardy! gamification; NLP algorithms; NLP question-and-answer system; Newton-Raphson logistic regression; Watson gamification; Watson machine-learning pipelines; algorithm engineering; business domain; configuration settings; data engineering techniques; domain gamification; extreme class imbalance; ingested corpus sets; large-scale natural language processing question-and-answer system; machine-learning process; nongame play; positive-to-negative ratio; question-and-answer pairs; real-world business problems; regularization term; structured data; training instances; true-false labels; unstructured data; Accuracy; Games; Logistics; Machine learning algorithms; Pipelines; Training; Gamification; machine learning; natural language processing (NLP); pattern recognition  (ID#: 15-3766)



Zadeh, B.Q.; Handschuh, S., "Random Manhattan Indexing," Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on, pp.203,208, 1-5 Sept. 2014. doi: 10.1109/DEXA.2014.51 Vector space models (VSMs) are mathematically well-defined frameworks that have been widely used in text processing. In these models, high-dimensional, often sparse vectors represent text units. In an application, the similarity of vectors -- and hence the text units that they represent -- is computed by a distance formula. The high dimensionality of vectors, however, is a barrier to the performance of methods that employ VSMs. Consequently, a dimensionality reduction technique is employed to alleviate this problem. This paper introduces a new method, called Random Manhattan Indexing (RMI), for the construction of L1 normed VSMs at reduced dimensionality. RMI combines the construction of a VSM and dimension reduction into an incremental, and thus scalable, procedure. In order to attain its goal, RMI employs the sparse Cauchy random projections.

Keywords: data reduction; indexing; text analysis; L1 normed VSM; RMI; dimensionality reduction technique; natural language text; random Manhattan indexing; sparse Cauchy random projections; vector space model; Computational modeling; Context; Equations; Indexing; Mathematical model; Vectors; Manhattan distance; dimensionality reduction; random projection; retrieval models; vector space model  (ID#: 15-3767)



Koch, S.; John, M.; Worner, M.; Muller, A.; Ertl, T., "VarifocalReader — In-Depth Visual Analysis of Large Text Documents," Visualization and Computer Graphics, IEEE Transactions on, vol. 20, no.12, pp. 1723, 1732, Dec. 31 2014. doi: 10.1109/TVCG.2014.2346677 Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.

Keywords: data visualisation; learning (artificial intelligence);text analysis; document analysis; focus-context techniques ;in-depth visual analysis; intermediate text levels; literary analysis; machine learning techniques; natural language processing; text documents ;text mining; varifocalreader; visual abstraction; Data mining; Data visualization; Document handling; Interactive systems; Natural language processing; Navigation; Tag clouds; Text mining; distant reading; document analysis; literary analysis; machine learning; natural language processing; text mining; visual analytics  (ID#: 15-3768)



Lomotey, R.K.; Deters, R., "Terms Mining in Document-Based NoSQL: Response to Unstructured Data," Big Data (BigData Congress), 2014 IEEE International Congress on, pp. 661, 668, June 27 2014-July 2 2014. doi: 10.1109/BigData.Congress.2014.99  Unstructured data mining has become topical recently due to the availability of high-dimensional and voluminous digital content (known as "Big Data") across the enterprise spectrum. The Relational Database Management Systems (RDBMS) have been employed over the past decades for content storage and management, but, the ever-growing heterogeneity in today's data calls for a new storage approach. Thus, the NoSQL database has emerged as the preferred storage facility nowadays since the facility supports unstructured data storage. This creates the need to explore efficient data mining techniques from such NoSQL systems since the available tools and frameworks which are designed for RDBMS are often not directly applicable. In this paper, we focused on topics and terms mining, based on clustering, in document-based NoSQL. This is achieved by adapting the architectural design of an analytics-as-a-service framework and the proposal of the Viterbi algorithm to enhance the accuracy of the terms classification in the system. The results from the pilot testing of our work show higher accuracy in comparison to some previously proposed techniques such as the parallel search.

Keywords: Big Data; data mining; database management systems; document handling; pattern classification; pattern clustering; text analysis; Big Data; NoSQL database; Viterbi algorithm; analytics-as-a-service framework; clustering; data mining techniques; document-based NoSQL; term classification; terms mining; topics mining; unstructured data storage; Big data; Classification algorithms; Data mining; Databases; Dictionaries; Semantics; Viterbi algorithm; Association Rules; Big Bata; NoSQL; Terms; Unstructured Data Mining; Viterbi algorithm; classification; clustering  (ID#: 15-3769)



Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.