Performance comparison of clustering algorithms on scientific publications

Anne Parlina, Kalamullah Ramli

Research output: Contribution to journalArticle

Abstract

The enormous increase of scientific papers in digital form has increased document management complexity. The development of effective and efficient methods to sort and organize the documents is thus very crucial. Clustering is one of data mining techniques widely applied in various field that may be used to resolve the issue. This paper presents the performance comparison of partitioning-based clustering algorithms, namely random clustering, k-means, x-means, and k-medoids in an unsupervised classification of scientific publications based on topic similarity. Rapidminer is utilized to preprocess and analyze the data. Afterwards, the purity value and processing time of each algorithm are investigated. The results show that k-means performs the best purity value, although its run time is not the fastest. Meanwhile random clustering offers the fastest processing time with the lowest purity value trade-off. None of the observed algorithms produce best purity and processing time at once. It may due to the complex of parameters that affect the clustering results, inter alia, the type of data, selected algorithm, distance measures, and preprocessing methods.

Original languageEnglish
Pages (from-to)3730-3732
Number of pages3
JournalAdvanced Science Letters
Volume23
Issue number4
DOIs
Publication statusPublished - 1 Jan 2017

Keywords

  • Clustering
  • Rapidminer
  • Text Mining

Fingerprint Dive into the research topics of 'Performance comparison of clustering algorithms on scientific publications'. Together they form a unique fingerprint.

  • Cite this