TY - JOUR
T1 - The distance function approach on the MiniBatchKMeans algorithm for the DPP-4 inhibitors on the discovery of type 2 diabetes drugs
AU - Syarofina, Sarah
AU - Bustamam, Alhadi
AU - Yanuar, Arry
AU - Sarwinda, Devvi
AU - Al-Ash, Herley S.
AU - Hayat, Abdul
N1 - Funding Information:
This research is supported by Tesis Magister Grant 2020 from Kementerian Riset dan Teknologi/ Badan Riset dan Inovasi Nasional, Indonesia No. NKB-477/UN2.RST/HKP.05.00/2020.
Publisher Copyright:
© 2021 Elsevier B.V.. All rights reserved.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021
Y1 - 2021
N2 - Several of the DPP-4 inhibitors in the treatment of type 2 diabetes (T2DM) still have unsafe side effects in long-term use. It is necessary to develop a new DPP-4 inhibitor to minimize these unsafe side effects of the drug. QSAR is a model that can be used for the development of DPP-4 inhibitor drugs. The selection of a subset of DPP-4 inhibitor molecules by applying the clustering method can be made to improve the accuracy of the QSAR model. This study aims to select the corresponding DPP-4 inhibitor molecules by using the MiniBatchKMeans algorithm with Levenshtein distance and based on the logP criteria of 'Lipinski's Rule of 5' for QSAR modeling. The research began with the collection of DPP-4 inhibitor molecule data from the ChEMBL database site (https://www.ebi.ac.uk/chembl/) in CSV format. A representation of the molecular structure of the data is obtained from their SMILES features. Before running the clustering process, data in the form of SMILES is extracted into molecular fingerprints using several fingerprint generators, namely MACCS, ECFP, and FCFP. Clustering was carried out on five fingerprint datasets, including ECFP (with 4 and 6 diameters), FCFP (with 4 and 6 diameters), and MACCS (167 structural keys). The clustering process begins by determining the optimal number of clusters evaluated by applying the Davies-Bouldin index, the Silhouette coefficient, and the Calinski Harabasz score. Based on the clustering process, 1540 clusters were obtained from the minimum DBI cluster evaluation values of 0.545311, maximum SCO of 0.302842, and maximum CHS of 331.3942 from the MACCS fingerprint dataset. Based on logP criteria from 'Lipinski's Rule of 5', 1532 molecules were obtained for the molecular selection process that have logP values between -0.205 to 4.95.
AB - Several of the DPP-4 inhibitors in the treatment of type 2 diabetes (T2DM) still have unsafe side effects in long-term use. It is necessary to develop a new DPP-4 inhibitor to minimize these unsafe side effects of the drug. QSAR is a model that can be used for the development of DPP-4 inhibitor drugs. The selection of a subset of DPP-4 inhibitor molecules by applying the clustering method can be made to improve the accuracy of the QSAR model. This study aims to select the corresponding DPP-4 inhibitor molecules by using the MiniBatchKMeans algorithm with Levenshtein distance and based on the logP criteria of 'Lipinski's Rule of 5' for QSAR modeling. The research began with the collection of DPP-4 inhibitor molecule data from the ChEMBL database site (https://www.ebi.ac.uk/chembl/) in CSV format. A representation of the molecular structure of the data is obtained from their SMILES features. Before running the clustering process, data in the form of SMILES is extracted into molecular fingerprints using several fingerprint generators, namely MACCS, ECFP, and FCFP. Clustering was carried out on five fingerprint datasets, including ECFP (with 4 and 6 diameters), FCFP (with 4 and 6 diameters), and MACCS (167 structural keys). The clustering process begins by determining the optimal number of clusters evaluated by applying the Davies-Bouldin index, the Silhouette coefficient, and the Calinski Harabasz score. Based on the clustering process, 1540 clusters were obtained from the minimum DBI cluster evaluation values of 0.545311, maximum SCO of 0.302842, and maximum CHS of 331.3942 from the MACCS fingerprint dataset. Based on logP criteria from 'Lipinski's Rule of 5', 1532 molecules were obtained for the molecular selection process that have logP values between -0.205 to 4.95.
KW - Clustering
KW - DPP-4 inhibitors
KW - Levenshtein distance
KW - MiniBatchKMeans algorithm
KW - Molecular fingerprint
KW - Type 2 diabetes
UR - http://www.scopus.com/inward/record.url?scp=85101767740&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2020.12.017
DO - 10.1016/j.procs.2020.12.017
M3 - Conference article
AN - SCOPUS:85101767740
SN - 1877-0509
VL - 179
SP - 127
EP - 134
JO - Procedia Computer Science
JF - Procedia Computer Science
T2 - 5th International Conference on Computer Science and Computational Intelligence, ICCSCI 2020
Y2 - 19 November 2020 through 20 November 2020
ER -