Visible to the public Natural Language Processing

SoS Newsletter- Advanced Book Block

Natural Language Processing

Natural Language Processing research focuses on developing efficient algorithms to process texts and to make their information accessible to computer applications. Texts can contain information with different complexities ranging from simple word or token-based representations, to rich hierarchical syntactic representations, to high-level logical representations across document collections. Research cited in this area was presented between January and August of 2014. Specific languages addressed include Turkish, Hindi, Bangla, and Farsi, as well as English.

  • Cambria, E.; White, B., "Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]," Computational Intelligence Magazine, IEEE, vol.9, no.2, pp.48,57, May 2014. doi: 10.1109/MCI.2014.2307227 Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of `jumping curves' from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves-namely Syntactics, Semantics, and Pragmatics Curves which will eventually lead NLP research to evolve into natural language understanding.
    Keywords: Internet; natural language processing; search engines; Google; NLP research evolution; NLP technology; Webpages; automatic human language analysis; automatic human language representation; batch processing; business management; computational techniques; jumping NLP curves; marketing prediction; natural language processing research; natural language understanding; pragmatics curve; punch cards; semantics curve; syntactics curve (ID#:14-2976)
  • Estiri, A; Kahani, M.; Ghaemi, H.; Abasi, M., "Improvement of an Abstractive Summarization Evaluation Tool Using Lexical-Semantic Relations And Weighted Syntax Tags In Farsi Language," Intelligent Systems (ICIS), 2014 Iranian Conference on, pp.1,6, 4-6 Feb. 2014. doi: 10.1109/IranianCIS.2014.6802594 In recent years, high increase in the amount of published web elements and the need to store, classify, restore, and process them have intensified the importance of natural language processing and its related tools such as automatic summarizers and machine translators. In this paper, a novel approach for evaluating automatic abstractive summarization system is proposed which can also be used in the other Natural Language Processing and Information Retrieval Applications. By comparing auto-abstracts (abstracts created by machine) with human abstracts (ideal abstracts created by human), the metrics introduced in the proposed tool can automatically measure the quality of auto-abstracts. Evidently, we can't semantically compare texts of abstractive summaries by comparison of just their words' appearance. So it is necessary to use a lexical database such as WordNet. We use FerdowsNet with a proper idea for Farsi language and it notably improves the evaluation results. This tool has been assessed by linguistic experts. This tool contains metric for determining the quality of summaries automatically by comparing them with summaries generated by humans (Ideal summaries). Evidently, we can't semantically compare texts of abstractive summaries by comparison of just their words' appearance and it is necessary to use a lexical database. We use this database with a proper idea together with Farsi parser in order to identify groups forming sentences and the results of evaluation improve significantly.
    Keywords: database management systems; information retrieval; language translation; natural language processing; Farsi language; Web elements; WordNet; abstractive summaries; abstractive summarization evaluation tool; automatic abstractive summarization system; human abstracts; information retrieval applications; lexical database; lexical semantic relations; linguistic experts; machine translators; natural language processing; weighted syntax tags; Abstracts; Databases; Equations; Measurement; Natural language processing; Semantics; Standards; Automatic Abstractive Summarizer; Evaluation; Farsi Natural Language Processing (NLP); Parse tree; Semantics; Sentences groups; parser (ID#:14-2977)
  • Mills, M.T.; Bourbakis, N.G., "Graph-Based Methods for Natural Language Processing and Understanding--A Survey and Analysis," Systems, Man, and Cybernetics: Systems, IEEE Transactions on, vol.44, no.1, pp.59, 71, Jan. 2014. doi: 10.1109/TSMCC.2012.2227472 This survey and analysis presents the functional components, performance, and maturity of graph-based methods for natural language processing and natural language understanding and their potential for mature products. Resulting capabilities from the methods surveyed include summarization, text entailment, redundancy reduction, similarity measure, word sense induction and disambiguation, semantic relatedness, labeling (e.g., word sense), and novelty detection. Estimated scores for accuracy, coverage, scalability, and performance are derived from each method. This survey and analysis, with tables and bar graphs, offers a unique abstraction of functional components and levels of maturity from this collection of graph-based methodologies.
    Keywords: graph theory; natural language processing; bar graphs; functional components; graph-based methodologies; graph-based methods; labeling; mature products; maturity level; natural language processing; natural language understanding; novelty detection; redundancy reduction; scores estimation; semantic relatedness; similarity measure; summarization;tables; text entailment; word disambiguation; word sense induction; Accuracy; Clustering algorithms; Context; Natural language processing; Semantics; Signal processing algorithms; Syntactics; Graph methods; natural language processing (NLP); natural language understanding (NLU) (ID#:14-2978)
  • Kandasamy, K.; Koroth, P., "An Integrated Approach To Spam Classification On Twitter Using URL Analysis, Natural Language Processing And Machine Learning Techniques," Electrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference on, pp.1,5, 1-2 March 2014. doi: 10.1109/SCEECS.2014.6804508 In the present day world, people are so much habituated to Social Networks. Because of this, it is very easy to spread spam contents through them. One can access the details of any person very easily through these sites. No one is safe inside the social media. In this paper we are proposing an application which uses an integrated approach to the spam classification in Twitter. The integrated approach comprises the use of URL analysis, natural language processing and supervised machine learning techniques. In short, this is a three step process.
    Keywords: classification; learning (artificial intelligence);natural language processing; social networking (online);unsolicited e-mail; Twitter; URL analysis; natural language processing; social media; social networks; spam classification; spam contents; supervised machine learning techniques; Accuracy; Machine learning algorithms; Natural language processing; Training; Twitter; Unsolicited electronic mail; URLs; machine learning; natural language processing; tweets (ID#:14-2979)
  • Vincze, V.; Farkas, R., "De-identification in Natural Language Processing," Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on, pp.1300,1303, 26-30 May 2014. doi: 10.1109/MIPRO.2014.6859768 Natural language processing (NLP) systems usually require a huge amount of textual data but the publication of such datasets is often hindered by privacy and data protection issues. Here, we discuss the questions of de-identification related to three NLP areas, namely, clinical NLP, NLP for social media and information extraction from resumes. We also illustrate how de-identification is related to named entity recognition and we argue that de-identification tools can be successfully built on named entity recognizers.
    Keywords: data privacy; natural language processing; NLP areas; NLP systems; data protection; information extraction; natural language processing; privacy protection; social media; textual data; Databases; Educational institutions; Electronic mail; Informatics; Information retrieval; Media; Natural language processing (ID#:14-2980)
  • Leopold, H.; Mendling, J.; Polyvyanyy, A, "Supporting Process Model Validation through Natural Language Generation," Software Engineering, IEEE Transactions on, vol.40, no.8, pp.818, 840, Aug. 1 2014. doi: 10.1109/TSE.2014.2327044 The design and development of process-aware information systems is often supported by specifying requirements as business process models. Although this approach is generally accepted as an effective strategy, it remains a fundamental challenge to adequately validate these models given the diverging skill set of domain experts and system analysts. As domain experts often do not feel confident in judging the correctness and completeness of process models that system analysts create, the validation often has to regress to a discourse using natural language. In order to support such a discourse appropriately, so-called verbalization techniques have been defined for different types of conceptual models. However, there is currently no sophisticated technique available that is capable of generating natural-looking text from process models. In this paper, we address this research gap and propose a technique for generating natural language texts from business process models. A comparison with manually created process descriptions demonstrates that the generated texts are superior in terms of completeness, structure, and linguistic complexity. An evaluation with users further demonstrates that the texts are very understandable and effectively allow the reader to infer the process model semantics. Hence, the generated texts represent a useful input for process model validation.
    Keywords: information systems; natural language processing; business process models completeness complexity; linguistic complexity; natural language generation; natural language text generation; natural-looking text generation; process model completeness; process model correctness; process model validation; process-aware information systems; structure complexity; verbalization techniques; Adaptation models; Analytical models; Business; Context; Context modeling; Natural languages; Unified modeling language; Business process model validation; natural language text generation; verbalization (ID#:14-2981)
  • Khanaferov, David; Luc, Christopher; Wang, Taehyung, "Social Network Data Mining Using Natural Language Processing and Density Based Clustering," Semantic Computing (ICSC), 2014 IEEE International Conference on, pp.250,251, 16-18 June 2014. doi: 10.1109/ICSC.2014.48 There is a growing need to make sense of all the raw data available on the Internet, hence, the purpose of this study is to explore the capabilities of data mining algorithms applied to social networks. We propose a system to mine public Twitter data for information relevant to obesity and health as an initial case study. This paper details the findings of our project and critiques the use of social networks for data mining purposes.
    Keywords: Cleaning; Clustering algorithms ;Data mining; Natural language processing; Semantics; Twitter; NLP; clustering; data mining; sentiment analysis; social network (ID#:14-2982)
  • Ozturk, S.; Sankur, B.; Gungor, T.; Yilmaz, M.B.; Koroglu, B.; Agin, O.; Isbilen, M.; Ulas, C.; Ahat, M., "Turkish Labeled Text Corpus," Signal Processing and Communications Applications Conference (SIU), 2014 22nd, pp.1395,1398, 23-25 April 2014. doi: 10.1109/SIU.2014.6830499 A labeled text corpus made up of Turkish papers' titles, abstracts and keywords is collected. The corpus includes 35 number of different disciplines, and 200 documents per subject. This study presents the text corpus' collection and content. The classification performance of Term Frequcney - Inverse Document Frequency (TF-IDF) and topic probabilities of Latent Dirichlet Allocation (LDA) features are compared for the text corpus. The text corpus is shared as open source so that it could be used for natural language processing applications with academic purposes.
    Keywords: natural language processing; pattern classification; probability; text analysis; LDA features; TF-IDF; Turkish labeled text corpus; Turkish paper abstracts; Turkish paper keywords; Turkish paper titles; academic purposes; classification performance; latent Dirichlet allocation features; natural language processing applications; term frequency-inverse document frequency; text corpus collection; text corpus content; topic probabilities; Abstracts; Conferences; Natural language processing; Resource management; Signal processing; Support vector machines; XML; Classification; Corpus; Inverse Document Frequency; Latent Dirichlet Allocation; NLP; Natural Language Processing; Paper; TF-IDF; Term Frequency ; Turkish (ID#:14-2983)
  • Ucan, S.; Huanying Gu, "A Platform For Developing Privacy Preserving Diagnosis Mobile Applications," Biomedical and Health Informatics (BHI), 2014 IEEE-EMBS International Conference on, pp.509, 512, 1-4 June 2014. doi: 10.1109/BHI.2014.6864414 Healthcare Information Technology has been in great vogue in the world today due to the dominant need of computational intelligence for processing, retrieval, and the use of health care information. This paper presents a platform system for developing self-diagnosis mobile applications. The mobile application developers can use this platform to develop applications that give the possible diagnosis according to users' symptoms without revealing any sensitive information about the users. The system consists of stop word removal, natural language processing, privacy preserving information retrieval, and decision support.
    Keywords: data privacy; health care; information retrieval; information use; mobile computing; computational intelligence; decision support; health care information processing; health care information retrieval; health care information use; healthcare information technology; natural language processing; privacy preserving diagnosis mobile application development; privacy preserving information retrieval; stop word removal; Databases; Diseases; Medical diagnostic imaging; Mobile communication; Natural language processing; Servers; Decision Support; Healthcare; Natural Language Processing; Privacy (ID#:14-2984)
  • Kats, Yefim, "Semantic Search and NLP-Based Diagnostics," Computer-Based Medical Systems (CBMS), 2014 IEEE 27th International Symposium on, pp.277,280, 27-29 May 2014 doi: 10.1109/CBMS.2014.68 This study considers issues in semantic representation of written texts, especially in the context of entropy-based approach to natural language processing in biomedical applications. These issues lie at the intersection of Web search methodologies, ontology studies, lexicon studies, and natural language processing. The presented in the article entropy-based methodology is aimed at enhancing search techniques and diagnostics by capturing semantic properties of written texts. The range of possible applications ranges from forensic linguistics to psychological diagnostics and evaluation. The presented case study assumes that for texts written under atypical mental conditions, the level of relative text entropy may fall below a certain threshold and the distribution of entropy across the text may show unusual patterns, thus contributing to the semantic assessment of a subject's mental state. Further processing methods potentially contributing to psychological evaluation diagnosis and ontology-based search are discussed.
    Keywords: Cultural differences; Entropy; Medical diagnostic imaging; Natural language processing; Ontologies; Psychology; Semantics; Semantic Web; diagnostics; lexicon; natural language processing; ontology; psychological evaluation; text entropy (ID#:14-2985)
  • Heimerl, F.; Lohmann, S.; Lange, S.; Ertl, T., "Word Cloud Explorer: Text Analytics Based on Word Clouds," System Sciences (HICSS), 2014 47th Hawaii International Conference on, pp.1833, 1842, 6-9 Jan, 2014. doi: 10.1109/HICSS.2014.231 Word clouds have emerged as a straightforward and visually appealing visualization method for text. They are used in various contexts as a means to provide an overview by distilling text down to those words that appear with highest frequency. Typically, this is done in a static way as pure text summarization. We think, however, that there is a larger potential to this simple yet powerful visualization paradigm in text analytics. In this work, we explore the usefulness of word clouds for general text analysis tasks. We developed a prototypical system called the Word Cloud Explorer that relies entirely on word clouds as a visualization method. It equips them with advanced natural language processing, sophisticated interaction techniques, and context information. We show how this approach can be effectively used to solve text analysis tasks and evaluate it in a qualitative user study.
    Keywords: data visualisation; natural language processing; text analysis; context information; natural language processing; sophisticated interaction techniques; text analysis tasks; text analytics; text summarization; visualization method; visualization paradigm; word cloud explorer; word clouds; Context; Layout; Pragmatics; Tag clouds; Text analysis; User interfaces; Visualization ;interaction; natural language processing; tag clouds; text analytics; visualization; word cloud explorer; word clouds (ID#:14-2986)
  • Jain, A; Lobiyal, D.K., "A New Method For Updating Word Senses in Hindi WordNet," Issues and Challenges in Intelligent Computing Techniques (ICICT), 2014 International Conference on , vol., no., pp.666,671, 7-8 Feb. 2014. doi: 10.1109/ICICICT.2014.6781359 Hindi WordNet, a rich computational lexicon is widely being used for many Hindi Natural Language Processing (NLP) applications. However it does not presently provide exhaustive list of senses for every word, which degrades the performance of such NLP applications. In this paper, we propose a graph based model and its associated techniques to automatically acquire words' senses. In the literature no such method is available which is capable of automatically identify the senses of the Hindi words. We use a Hindi part of speech tagged corpus for building the graph model. The linkage between noun-noun concepts is extracted on the basis of syntactic and semantic relationships. All of the senses of a word including the sense which is not present in Hindi WordNet are extracted. Our method also finds the categories of similar words. Using this model applications of NLP can be achieved at a higher level.
    Keywords: graph theory; natural language processing; Hindi WordNet; Hindi natural language processing; Hindi part of speech tagged corpus; NLP applications; computational lexicon; graph based model; noun-noun concepts; semantic relationships; syntactic relationships; word sense updating; Speech; Hindi WordNet; Natural Language processing; Word sense Disambiguation (ID#:14-2987)
  • Rabbani, M.; Alam, K.M.R.; Islam, M., "A New Verb Based Approach For English To Bangla Machine Translation," Informatics, Electronics & Vision (ICIEV), 2014 International Conference on , vol., no., pp.1,6, 23-24 May 2014. doi: 10.1109/ICIEV.2014.6850684 This paper proposes verb based machine translation (VBMT), a new approach of machine translation (MT) from English to Bangla (EtoB). For translation, it simplifies any form (i.e. simple, complex, compound, active and passive form) of English sentence into the simplest form of English sentence i.e. subject plus verb plus object. When compared with existing rule based EtoB MT schemes, VBMT doesn't employ exclusive or individual structural rules of various English sentences; it only detects the main verb from any form of English sentence and then transforms it into the simplest form of English sentence. Thus VBMT can translate from EtoB very simply, correctly and efficiently. Rule based EtoB MT is tough because it requires the matching of sentences with the stored rules. Moreover, many existing EtoB MT schemes which deploy rules are almost inefficient to translate complex or complicated sentences because it is difficult to match them with well-established rules of English grammar. VBMT is efficient because after identifying the main verb of any form of English sentence, it binds the remaining parts of speech (POS) as subject and object. VBMT has been successfully implemented for the MT of Assertive, Interrogative, Imperative, Exclamatory, Active-Passive, Simple, Complex, and Compound form of English sentences applicable in both desktop and mobile applications.
    Keywords: language translation; natural language processing; English to Bangla machine translation; EtoB MT schemes; VBMT; human language technology; natural language processing; verb based machine translation;Compounds;Conferences;Databases;Informatics;Knowledge based systems; Natural languages; Tagging; English to Bangla; Human Language Technology; Natural Language Processing; Rule based Machine Translation (ID#:14-2988)
  • Sen, M.U.; Erdogan, H., "Learning Word Representations for Turkish," Signal Processing and Communications Applications Conference (SIU), 2014 22nd, pp.1742, 1745, 23-25 April 2014. doi: 10.1109/SIU.2014.6830586 High-quality word representations have been very successful in recent years at improving performance across a variety of NLP tasks. These word representations are the mappings of each word in the vocabulary to a real vector in the Euclidean space. Besides high performance on specific tasks, learned word representations have been shown to perform well on establishing linear relationships among words. The recently introduced skip-gram model improved performance on unsupervised learning of word embeddings that contains rich syntactic and semantic word relations both in terms of accuracy and speed. Word embeddings that have been used frequently on English language, is not applied to Turkish yet. In this paper, we apply the skip-gram model to a large Turkish text corpus and measured the performance of them quantitatively with the "question" sets that we generated. The learned word embeddings and the question sets are publicly available at our website.
    Keywords: learning (artificial intelligence); natural language processing; text analysis; English language; Euclidean space; NLP tasks; Turkish text corpus; high-quality word representations; learned word embeddings; learned word representations; linear relationships; question sets; skip-gram model; unsupervised learning; word embeddings; Conferences; Natural language processing; Probabilistic logic; Recurrent neural networks; Signal processing; Vectors; Deep Learning; Natural Language Processing; Word embeddings (ID#:14-2989)
  • Anastasakos, T.; Young-Bum Kim; Deoras, A, "Task Specific Continuous Word Representations For Mono And Multi-Lingual Spoken Language Understanding," Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on , vol., no., pp.3246,3250, 4-9 May 2014. doi: 10.1109/ICASSP.2014.6854200 Models for statistical spoken language understanding (SLU) systems are conventionally trained using supervised discriminative training methods. In many cases, however, labeled data necessary for these supervised techniques is not readily available necessitating a laborious data collection and annotation effort. This often results into data sets that are not expansive enough to cover adequately all patterns of natural language phrases that occur in the target applications. Word embedding features alleviate data and feature sparsity issues by learning mathematical representation of words and word associations in the continuous space. In this work, we present techniques to obtain task and domain specific word embeddings and show their usefulness over those obtained from generic unsupervised data. We also show how we transfer these embeddings from one language to another enabling training of a multilingual spoken language understanding system.
    Keywords: learning (artificial intelligence); natural language processing; SLU system; data annotation; data collection; domain specific word embeddings; monolingual spoken language understanding; multilingual spoken language understanding; natural language phrases; supervised discriminative training methods; task specific continuous word representation; Context; Encyclopedias; Games; Motion pictures; Semantics; Training; Vocabulary; named entity recognition; natural language processing; spoken language understanding; vector space models; word embedding}, (ID#:14-2990)


Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to SoS.Project (at) for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.