TY - GEN
T1 - Performance Evaluation XGBoost in Handling Missing Value on Classification of Hepatocellular Carcinoma Gene Expression Data
AU - Latief, Moh Abdul
AU - Bustamam, Alhadi
AU - Siswantining, Titin
N1 - Publisher Copyright:
© 2020 IEEE.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2020/11/10
Y1 - 2020/11/10
N2 - Missing values are a condition where there is no value in the observation, which results in loss of information. One of the steps in dealing with missing values is by deleting observations that have missing values. Still, for data that has small missing values, it can reduce important information from the data. This study analyzed the performance evaluation of the XGBoost method in dealing with missing values for classification problems in Hepatocellular Carcinoma gene expression data. This study used Hepatocellular Carcinoma gene expression data with 40 observations and 54675 features obtained from the National Center for Biotechnology Information website. The researchers randomly eliminated 5%, 10%, 15%, and 20% of the total data to compare the model's performance evaluation with the imputation method and without using Imputation. The imputation method used is the mean and k-nearest neighbor method. Measurement of model performance using cross-validation and confusion matrix evaluation procedures. In finding the best parameter, tuning hyperparameter using grid search. In general, the handling of missing values with the mean's Imputation is better in performance evaluation than the handling of missing values with the KNN imputation and without the imputation process for classifying Hepatocellular Carcinoma gene expression data. From the results of the above research, the value of missing 20% using the mean imputation method produces the highest evaluation performance value with 100% specificity, 100% sensitivity, 100% accuracy, 100% precision, and 100% MCC in training and testing data, and sensitivity 88%, 100% specificity, 100% precision, 94% accuracy, and 89% MCC. The XGBoost Machine learning model can handle missing values in a dataset without Imputation, but with the imputation method can improve performance evaluation on classification Hepatocellular Carcinoma Gene Expression Data.
AB - Missing values are a condition where there is no value in the observation, which results in loss of information. One of the steps in dealing with missing values is by deleting observations that have missing values. Still, for data that has small missing values, it can reduce important information from the data. This study analyzed the performance evaluation of the XGBoost method in dealing with missing values for classification problems in Hepatocellular Carcinoma gene expression data. This study used Hepatocellular Carcinoma gene expression data with 40 observations and 54675 features obtained from the National Center for Biotechnology Information website. The researchers randomly eliminated 5%, 10%, 15%, and 20% of the total data to compare the model's performance evaluation with the imputation method and without using Imputation. The imputation method used is the mean and k-nearest neighbor method. Measurement of model performance using cross-validation and confusion matrix evaluation procedures. In finding the best parameter, tuning hyperparameter using grid search. In general, the handling of missing values with the mean's Imputation is better in performance evaluation than the handling of missing values with the KNN imputation and without the imputation process for classifying Hepatocellular Carcinoma gene expression data. From the results of the above research, the value of missing 20% using the mean imputation method produces the highest evaluation performance value with 100% specificity, 100% sensitivity, 100% accuracy, 100% precision, and 100% MCC in training and testing data, and sensitivity 88%, 100% specificity, 100% precision, 94% accuracy, and 89% MCC. The XGBoost Machine learning model can handle missing values in a dataset without Imputation, but with the imputation method can improve performance evaluation on classification Hepatocellular Carcinoma Gene Expression Data.
KW - cross-validation
KW - grid search
KW - KNN imputation mean imputation
KW - XGBoost
UR - http://www.scopus.com/inward/record.url?scp=85099471842&partnerID=8YFLogxK
U2 - 10.1109/ICICoS51170.2020.9299012
DO - 10.1109/ICICoS51170.2020.9299012
M3 - Conference contribution
AN - SCOPUS:85099471842
T3 - ICICoS 2020 - Proceeding: 4th International Conference on Informatics and Computational Sciences
BT - ICICoS 2020 - Proceeding
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th International Conference on Informatics and Computational Sciences, ICICoS 2020
Y2 - 10 November 2020 through 11 November 2020
ER -