TY - JOUR
T1 - Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus
AU - Bustamam, Alhadi
AU - Hamzah, Haris
AU - Husna, Nadya A.
AU - Syarofina, Sarah
AU - Dwimantara, Nalendra
AU - Yanuar, Arry
AU - Sarwinda, Devvi
N1 - Funding Information:
This work was fully funded by PUTI Q1 2020 grant from Universitas Indonesia with contract number NKB-1381/UN2.RST/HKP.05.00/2020. The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Funding Information:
This research was supported by PUTI Q1 2020 grant from Universitas Indonesia with contract number NKB-1381/UN2.RST/HKP.05.00/2020. The authors appreciate colleagues from the Directorate General of Higher Education (BRIN/DIKTI), the Directorate of Research and Community Engagement Universitas Indonesia, and Data Science Center Universitas Indonesia who contributed insights and expertise to advance this research in innumerable ways. We also would like to thank all anonymous reviewers for their constructive advice.
Funding Information:
This research was supported by PUTI Q1 2020 grant from Universitas Indonesia with contract number NKB-1381/UN2.RST/HKP.05.00/2020. The authors appreciate colleagues from the Directorate General of Higher Education (BRIN/DIKTI), the Directorate of Research and Community Engagement Universitas Indonesia, and Data Science Center Universitas Indonesia who contributed insights and expertise to advance this research in innumerable ways. We also would like to thank all anonymous reviewers for their constructive advice.
Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12
Y1 - 2021/12
N2 - Background: New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. Results: The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%. Conclusion: The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models.
AB - Background: New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. Results: The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%. Conclusion: The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models.
KW - CatBoost
KW - Deep neural network
KW - Fingerprint
KW - K-modes clustering
KW - principal component analysis
KW - Quantitative structure-activity relationship
KW - Rotation Forest
KW - Sparse principal component analysis
UR - http://www.scopus.com/inward/record.url?scp=85106869524&partnerID=8YFLogxK
U2 - 10.1186/s40537-021-00465-3
DO - 10.1186/s40537-021-00465-3
M3 - Article
AN - SCOPUS:85106869524
SN - 2196-1115
VL - 8
JO - Journal of Big Data
JF - Journal of Big Data
IS - 1
M1 - 74
ER -