Visible to the public Text Analytics

SoS Newsletter- Advanced Book Block

Text Analytics

Text analytics refers to linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for intelligence, exploratory data analysis, research, or investigation. The research cited here focuses on large volumes of text mined to identify insider threats, intrusions, and malware detection.

  • Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T., "Word Cloud Explorer: Text Analytics Based on Word Clouds," System Sciences (HICSS), 2014 47th Hawaii International Conference on , vol., no., pp.1833,1842, 6-9 Jan. 2014. (ID#:14-1448) Available at: Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study. Keywords: data visualization; natural language processing; text analysis; context information; natural language processing; sophisticated interaction techniques; text analysis tasks; text analytics; text summarization; visualization method; visualization paradigm; word cloud explorer; word clouds; Context; Layout; Pragmatics; Tag clouds; Text analysis; User interfaces; Visualization; interaction; natural language processing; tag clouds; text analytics; visualization; word cloud explorer; word clouds
  • Atasu, K.; Polig, R.; Hagleitner, C.; Reiss, F.R., "Hardware-accelerated regular expression matching for high-throughput text analytics," Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on , vol., no., pp.1,7, 2-4 Sept. 2013. (ID#:14-1449) Available at: Advanced text analytics systems combine regular expression (regex) matching, dictionary processing, and relational algebra for efficient information extraction from text documents. Such systems require support for advanced regex matching features, such as start offset reporting and capturing groups. However, existing regex matching architectures based on reconfigurable nondeterministic state machines and programmable deterministic state machines are not designed to support such features. We describe a novel architecture that supports such advanced features using a network of state machines. We also present a compiler that maps the regexs onto such networks that can be efficiently realized on reconfigurable logic. For each regex, our compiler produces a state machine description, statically computes the number of state machines needed, and produces an optimized interconnection network. Experiments on an Altera Stratix IV FPGA, using regexs from a real life text analytics benchmark, show that a throughput rate of 16 Gb/s can be reached. keywords: {field programmable gate arrays; finite state machines; knowledge acquisition; pattern matching; relational algebra; text analysis; Altera Stratix IV FPGA; bit rate 16 Gbit/s; capturing groups; compiler; dictionary processing; hardware-accelerated regular expression matching; high-throughput text analytics; information extraction; optimized interconnection network; programmable deterministic state machines; reconfigurable logic; reconfigurable nondeterministic state machines; regex matching architectures; relational algebra; start offset reporting; text documents; Delays; Dictionaries; Doped fiber amplifiers; Multiprocessor interconnection; Registers; Semantics
  • Polig, R.; Atasu, K.; Hagleitner, C., "Token-based dictionary pattern matching for text analytics," Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, vol., no., pp.1,6, 2-4 Sept. 2013. (ID#:14-1450) Available at: When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s. Keywords: dictionaries; finite state machines; pattern matching; query processing; text analysis; Altera Stratix IV GX530; cascaded result lookup; compilable architecture; dedicated circuitry; deterministic performance; dictionary detection; dictionary matching; multitoken elements; nondeterministic automaton; pattern detection engine; programmable state machine; text analytics querying; token pattern sequence detection; token sequence detection; token-based dictionary pattern matching; unstructured text data; Automata; Computer architecture; Dictionaries; Doped fiber amplifiers; Engines; Pattern matching; Throughput
  • Dey, L.; Verma, I., "Text-Driven Multi-structured Data Analytics for Enterprise Intelligence," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on , vol.3, no., pp.213,220, 17-20 Nov. 2013. (ID#:14-1451) Available at: Text data constitutes the bulk of all enterprise data. Text repositories are not only tacit store-houses of knowledge about its people, projects and processes but also contain invaluable information about its customers, competitors, suppliers, partners and all other stakeholders. Mining this data can provide interesting and valuable insights provided it is appropriately integrated with other enterprise data. In this paper we propose a framework for text-driven analysis of multi-structured data. Keywords: business data processing; competitive intelligence; data analysis; data mining; text analysis; data mining; enterprise data; enterprise intelligence; text data; text driven analysis; text driven multistructured data analytics; text repositories; Business; Context; Media; Natural language processing; Semantics; Text mining; Information Fusion; Text Analytics
  • Agarwal, K.; Polig, R., "A high-speed and large-scale dictionary matching engine for Information Extraction systems," Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on , vol., no., pp.59,66, 5-7 June 2013. (ID#:14-1452) Available at: Dictionary matching is a commonly used operation in Information Extraction (IE) systems. It involves matching a set of strings in a document against a dictionary of pre-defined patterns. In this paper, we describe a high performance and scalable hardware architecture to enable high throughput dictionary matching on very large dictionaries for text analytics applications. Our hardware accelerator employs a novel hashing based approach instead of commonly used deterministic finite automata (DFA) based algorithms. A limitation of the DFA based approaches is that they typically process one character every cycle, while the proposed hash based scheme can process a string token every cycle, thus achieving significantly higher processing throughput than the DFA based implementations. Our measurement results based on a prototype implementation on an Altera Stratix IV FPGA device indicate that our hardware dictionary matching engine can process typical document streams at a processing rate of ~1.5GB/s (~12 Gbps) while simultaneously allowing support for large dictionary sizes containing up to ~100K patterns, thus making it very useful for IE workload acceleration. Keywords: dictionaries; field programmable gate arrays; file organization; information retrieval systems; string matching; text analysis; Altera Stratix IV FPGA device; DFA based algorithms; IE systems; IE workload acceleration; deterministic finite automata based algorithms; hardware accelerator; hardware dictionary matching engine; hashing based approach; high throughput dictionary matching; high-speed dictionary matching engine; information extraction system; arge-scale dictionary matching engine; scalable hardware architecture; string matching; string token; text analytics applications; Arrays; Dictionaries; Field programmable gate arrays; Hardware; Pattern matching; Random access memory; Throughput; FPGA; dictionary matching; hardware acceleration; hashing; information extraction; pattern matching; string matching; text analytics
  • Clemons, T.; Faisal, S.M.; Tatikonda, S.; Aggarwal, C.; Parthasarathy, S., "Hash in a flash: Hash tables for flash devices," Big Data, 2013 IEEE International Conference on , vol., no., pp.7,14, 6-9 Oct. 2013. (ID#:14-1453) Available at: Conservative estimates place the amount of data expected to be created by mankind this year to exceed several thousand exabytes. Given the enormous data deluge, and in spite of recent advances in main memory capacities, there is a clear and present need to move beyond algorithms that assume in-core (main-memory) computation. One fundamental task in Information Retrieval and text analytics requires the maintenance of local and global term frequencies from within large enterprise document corpora. This can be done with a counting hash-table; they associate keys to frequencies. In this paper, we will study the design landscape for the development of such an out-of-core counting hash table targeted at flash storage devices. Flash devices have clear benefits over traditional hard drives in terms of latency of access and energy efficiency. However, due to intricacies in their design, random writes can be relatively expensive and can degrade the life of the flash device. Counting hash tables are a challenging case for the flash drive because this data structure is inherently dependent upon the randomness of the hash function; frequency updates are random and may incur random expensive random writes. We demonstrate how to overcome this challenge by designing a hash table with two related hash functions, one of which exhibits a data placement property with respect to the other. Specifically, we focus on three designs and evaluate the trade-offs among them along the axes of query performance, insert and update times, and I/O time using real-world data and an implementation of TF-IDF. Keywords: data structures; flash memories; TF-IDF; data deluge; data placement property; data structure; energy efficiency; enterprise document corpora; flash storage devices; global term frequencies maintenance; in-core main-memory computation; information retrieval; local term frequencies maintenance; memory capacities; out-of-core counting hash table; query performance; text analytics; Ash; Context; Encyclopedias; Internet; Performance evaluation; Random access memory
  • Zhang, Yan; Ma, Hongtao; Xu, Yunfeng, "An Intelligence Gathering System for Business Based on Cloud Computing," Computational Intelligence and Design (ISCID), 2013 Sixth International Symposium on , vol.1, no., pp.201,204, 28-29 Oct. 2013. (ID#:14-1454) Available at: With the continued exponential growth in both complexity and volume of unstructured internet data, and enterprises become more automated, data driven and real-time, traditional business intelligence and analytics system meet new challenges. As with the Cloud Computing development, some parallel data analysis systems have been emerging. However, existing systems rarely have comprehensive function, either providing gathering service or data analysis service. Our project needs a comprehensive tool to store and analysis large scale data efficiently. In response to these challenges, a business intelligence gathering system based on Cloud computing is proposed. It supports parallel ETL process, text mining which are based on Hadoop. The demo achieves Chinese Word Segmentation, Bayesian classification algorithm and K-means algorithm in the MapReduce architecture to form the omni bearing and three-dimensional intelligence noumenon for enterprises. It can meet the needs on timeliness and pertinence of the information, or even can achieve real-time intelligence gathering and analytics. Keywords: MapReduce; classification; clustering; hadoop; intelligence gathering
  • Logasa Bogen, P.; Symons, C.T.; McKenzie, A.; Patton, R.M.; Gillen, R.E., "Massively scalable near duplicate detection in streams of documents using MDSH," Big Data, 2013 IEEE International Conference on , vol., no., pp.480,486, 6-9 Oct. 2013. (ID#:14-1455) Available at: In a world where large-scale text collections are not only becoming ubiquitous but also are growing at increasing rates, near duplicate documents are becoming a growing concern that has the potential to hinder many different information filtering tasks. While others have tried to address this problem, prior techniques have only been used on limited collection sizes and static cases. We will briefly describe the problem in the context of Open Source analysis along with our additional constraints for performance. In this work we propose two variations on Multi-dimensional Spectral Hash (MDSH) tailored for working on extremely large, growing sets of text documents. We analyze the memory and runtime characteristics of our techniques and provide an informal analysis of the quality of the near-duplicate clusters produced by our techniques. Keywords: file organization; information filtering; public domain software; text analysis; MDSH; document stream; information filtering task; large-scale text collections; memory characteristics; multidimensional spectral hash; near duplicate detection; near duplicate documents; near-duplicate clusters; open source analysis; quality informal analysis; runtime characteristics; text documents; Electronic publishing; Encyclopedias; Internet; Memory management; Random access memory; Runtime; Big Data; MDSH; Near Duplicate Detection; Open Source Intelligence; Streaming Text
  • Hung Son Nguyen, "Tolerance Rough Set Model and Its Applications in Web Intelligence," Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on , vol.3, no., pp.237,244, 17-20 Nov. 2013. (ID#:14-1456) Available at: Tolerance Rough Set Model (TRSM) has been introduced as a tool for approximation of hidden concepts in text databases. In recent years, numerous successful applications of TRSM in web intelligence including text classification, clustering, thesaurus generation, semantic indexing, and semantic search, etc., have been proposed. This paper will review the fundamental concepts of TRSM, some of its possible extensions and some typical applications of TRSM in text mining. Moreover, the architecture o a semantic information retrieval system, called SONCA, will be presented to demonstrate the main idea as well as stimulate the further research on TRSM. Keywords: data mining; information retrieval systems; ontologies (artificial intelligence); rough set theory ;text analysis; SONCA system; TRSM; Web intelligence; clustering; search based on ontologies and compound analytics; semantic indexing; semantic information retrieval system; semantic search; text classification; text databases; text mining; thesaurus generation; tolerance rough set model; Approximation methods; Indexes; Information retrieval; Ontologies; Semantics; Standards; Vectors; Tolerance rough set model; classification; clustering; semantic indexing; semantic search
  • Sundarkumar, G.G.; Ravi, V., "Malware detection by text and data mining," Computational Intelligence and Computing Research (ICCIC), 2013 IEEE International Conference on , vol., no., pp.1,6, 26-28 Dec. 2013. (ID#:14-1457) Available at: Cyber frauds are a major security threat to the banking industry worldwide. Malware is one of the manifestations of cyber frauds. Malware authors use Application Programming Interface (API) calls to perpetrate these crimes. In this paper, we propose a static analysis method to detect Malware based on API call sequences using text and data mining in tandem. We analyzed the dataset available at CSMINING group. First, we employed text mining to extract features from the dataset consisting a series of API calls. Further, mutual information is invoked for feature selection. Then, we resorted to over-sampling to balance the data set. Finally, we employed various data mining techniques such as Decision Tree (DT), Multi Layer Perceptron (MLP), Support Vector Machine (SVM), Probabilistic Neural Network (PNN) and Group Method for Data Handling (GMDH). We also applied One Class SVM (OCSVM). Throughout the paper, we used 10-fold cross validation technique for testing the techniques. We observed that SVM and OCSVM achieved 100% sensitivity after balancing the dataset. Keywords: {application program interfaces; data mining; decision trees; feature extraction; invasive software; neural nets; support vector machines; text analysis; API call sequences; DT; GMDH; MLP; Malware authors; OCSVM;PNN; SVM; application programming interface; cyber frauds; data mining; decision tree; feature extraction; feature selection; group method for data handling; malware detection;multi layer perceptron; one class SVM; probabilistic neural network; security threat; static analysis method; support vector machine ;text mining; Accuracy; Feature extraction; Malware; Mutual information; Support vector machines; Text mining; Application Programming Interface calls; Data Mining; Mutual Information; Over Sampling; Text Mining
  • V.S. Subrahmanian, Handbook of Computational Approaches to Counterterrorism, Springer Publishing Company, 2013. (ID#:14-1458) Citation available at: This article invites individuals focused on counter-terrorism in research, academia, and industry to consider the advances in understanding terrorist groups that information technology has allowed. The particular focus of this article is the use of text analytics to anticipate terror group behavior, understand terror networks, and create defensive policies. This work explores the role of mathematics and modern computing as significant contributors to the study of terrorist organizations and groups.


Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to SoS.Project (at) for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.