The enormous increase of scientific papers in digital form has increased document management complexity. The development of effective and efficient methods to sort and organize the documents is thus very crucial. Clustering is one of data mining techniques widely applied in various field that may be used to resolve the issue. This paper presents the performance comparison of partitioning-based clustering algorithms, namely random clustering, k-means, x-means, and k-medoids in an unsupervised classification of scientific publications based on topic similarity. Rapidminer is utilized to preprocess and analyze the data. Afterwards, the purity value and processing time of each algorithm are investigated. The results show that k-means performs the best purity value, although its run time is not the fastest. Meanwhile random clustering offers the fastest processing time with the lowest purity value trade-off. None of the observed algorithms produce best purity and processing time at once. It may due to the complex of parameters that affect the clustering results, inter alia, the type of data, selected algorithm, distance measures, and preprocessing methods.
- Text Mining