TY - JOUR
T1 - Performance comparison of clustering algorithms on scientific publications
AU - Parlina, Anne
AU - Ramli, Kalamullah
N1 - Publisher Copyright:
© 2017 American Scientific Publishers All rights reserved.
PY - 2017
Y1 - 2017
N2 - The enormous increase of scientific papers in digital form has increased document management complexity. The development of effective and efficient methods to sort and organize the documents is thus very crucial. Clustering is one of data mining techniques widely applied in various field that may be used to resolve the issue. This paper presents the performance comparison of partitioning-based clustering algorithms, namely random clustering, k-means, x-means, and k-medoids in an unsupervised classification of scientific publications based on topic similarity. Rapidminer is utilized to preprocess and analyze the data. Afterwards, the purity value and processing time of each algorithm are investigated. The results show that k-means performs the best purity value, although its run time is not the fastest. Meanwhile random clustering offers the fastest processing time with the lowest purity value trade-off. None of the observed algorithms produce best purity and processing time at once. It may due to the complex of parameters that affect the clustering results, inter alia, the type of data, selected algorithm, distance measures, and preprocessing methods.
AB - The enormous increase of scientific papers in digital form has increased document management complexity. The development of effective and efficient methods to sort and organize the documents is thus very crucial. Clustering is one of data mining techniques widely applied in various field that may be used to resolve the issue. This paper presents the performance comparison of partitioning-based clustering algorithms, namely random clustering, k-means, x-means, and k-medoids in an unsupervised classification of scientific publications based on topic similarity. Rapidminer is utilized to preprocess and analyze the data. Afterwards, the purity value and processing time of each algorithm are investigated. The results show that k-means performs the best purity value, although its run time is not the fastest. Meanwhile random clustering offers the fastest processing time with the lowest purity value trade-off. None of the observed algorithms produce best purity and processing time at once. It may due to the complex of parameters that affect the clustering results, inter alia, the type of data, selected algorithm, distance measures, and preprocessing methods.
KW - Clustering
KW - Rapidminer
KW - Text Mining
UR - http://www.scopus.com/inward/record.url?scp=85021144860&partnerID=8YFLogxK
U2 - 10.1166/asl.2017.9003
DO - 10.1166/asl.2017.9003
M3 - Article
AN - SCOPUS:85021144860
SN - 1936-6612
VL - 23
SP - 3730
EP - 3732
JO - Advanced Science Letters
JF - Advanced Science Letters
IS - 4
ER -