Text preprocessing using annotated suffix tree with matching keyphrase

Ionia Veritawati, Ito Wasito, T. Basaruddin

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

Text document is an important source of information and knowledge. Most of the knowledge needed in various domains for different purposes is in form of implicit content. A content of text is represented by keyphrases, which consists of one or more meaningful words. Keyphrases can be extracted from text through several steps of processing, including text preprocessing. Annotated Suffix Tree (AST) built from the documents collection itself is used to extract the keyphrase, after basic text preprocessing that includes removing stop words and stemming are applied. Combination of four variations of preprocessing is used. Two words (bi-words) and three-words of phrases extracted are used as a list of keyphrases candidate which can help user who needs keyphrase information to understand content of documents. The candidate of keyphrase can be processed further by learning process to determine keyphrase or non keyphrase for the text domain with manual validation. Experiments using simulation corpus in which keyphrases are determined from them show that keyphrases of two and three words can be extracted more than 90%. Using real corpus of economy, keyphrases or meaningful phrases can be extracted about 70%. The proposed method can be an effective way to find candidate keyphrases from collection of text documents which can reduce non keyphrases or non meaningful phrases from list of keyphrase candidates and can detect keyphrases separated by stopwords.

Original languageEnglish
Pages (from-to)409-420
Number of pages12
JournalInternational Journal of Electrical and Computer Engineering
Volume5
Issue number3
DOIs
Publication statusPublished - 1 Jun 2015

Keywords

  • 2-means clustering
  • Annotated suffix tree
  • Keyphrase
  • Preprocessing
  • TF-IDF

Fingerprint

Dive into the research topics of 'Text preprocessing using annotated suffix tree with matching keyphrase'. Together they form a unique fingerprint.

Cite this