The distance function approach on the MiniBatchKMeans algorithm for the DPP-4 inhibitors on the discovery of type 2 diabetes drugs

Sarah Syarofina, Alhadi Bustamam, Arry Yanuar, Devvi Sarwinda, Herley S. Al-Ash, Abdul Hayat

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)


Several of the DPP-4 inhibitors in the treatment of type 2 diabetes (T2DM) still have unsafe side effects in long-term use. It is necessary to develop a new DPP-4 inhibitor to minimize these unsafe side effects of the drug. QSAR is a model that can be used for the development of DPP-4 inhibitor drugs. The selection of a subset of DPP-4 inhibitor molecules by applying the clustering method can be made to improve the accuracy of the QSAR model. This study aims to select the corresponding DPP-4 inhibitor molecules by using the MiniBatchKMeans algorithm with Levenshtein distance and based on the logP criteria of 'Lipinski's Rule of 5' for QSAR modeling. The research began with the collection of DPP-4 inhibitor molecule data from the ChEMBL database site ( in CSV format. A representation of the molecular structure of the data is obtained from their SMILES features. Before running the clustering process, data in the form of SMILES is extracted into molecular fingerprints using several fingerprint generators, namely MACCS, ECFP, and FCFP. Clustering was carried out on five fingerprint datasets, including ECFP (with 4 and 6 diameters), FCFP (with 4 and 6 diameters), and MACCS (167 structural keys). The clustering process begins by determining the optimal number of clusters evaluated by applying the Davies-Bouldin index, the Silhouette coefficient, and the Calinski Harabasz score. Based on the clustering process, 1540 clusters were obtained from the minimum DBI cluster evaluation values of 0.545311, maximum SCO of 0.302842, and maximum CHS of 331.3942 from the MACCS fingerprint dataset. Based on logP criteria from 'Lipinski's Rule of 5', 1532 molecules were obtained for the molecular selection process that have logP values between -0.205 to 4.95.

Original languageEnglish
Pages (from-to)127-134
Number of pages8
JournalProcedia Computer Science
Publication statusPublished - 2021
Event5th International Conference on Computer Science and Computational Intelligence, ICCSCI 2020 - Virtual, Online, Indonesia
Duration: 19 Nov 202020 Nov 2020


  • Clustering
  • DPP-4 inhibitors
  • Levenshtein distance
  • MiniBatchKMeans algorithm
  • Molecular fingerprint
  • Type 2 diabetes


Dive into the research topics of 'The distance function approach on the MiniBatchKMeans algorithm for the DPP-4 inhibitors on the discovery of type 2 diabetes drugs'. Together they form a unique fingerprint.

Cite this