Text document is an important source of information and knowledge. Most of the knowledge needed in various domains for different purposes is in form of implicit content. A content of text is represented by keyphrases, which consists of one or more meaningful words. Keyphrases can be extracted from text through several steps of processing, including text preprocessing. Annotated Suffix Tree (AST) built from the documents collection itself is used to extract the keyphrase, after basic text preprocessing that includes removing stop words and stemming are applied. Combination of four variations of preprocessing is used. Two words (bi-words) and three-words of phrases extracted are used as a list of keyphrases candidate which can help user who needs keyphrase information to understand content of documents. The candidate of keyphrase can be processed further by learning process to determine keyphrase or non keyphrase for the text domain with manual validation. Experiments using simulation corpus in which keyphrases are determined from them show that keyphrases of two and three words can be extracted more than 90%. Using real corpus of economy, keyphrases or meaningful phrases can be extracted about 70%. The proposed method can be an effective way to find candidate keyphrases from collection of text documents which can reduce non keyphrases or non meaningful phrases from list of keyphrase candidates and can detect keyphrases separated by stopwords.
|Number of pages||12|
|Journal||International Journal of Electrical and Computer Engineering|
|Publication status||Published - 1 Jun 2015|
- 2-means clustering
- Annotated suffix tree