TY - JOUR
T1 - Iterative bicluster-based Bayesian principal component analysis and least squares for missing-value imputation in microarray and RNA-sequencing data
AU - Soemartojo, Saskya Mary
AU - Siswantining, Titin
AU - Fernando, Yoel
AU - Sarwinda, Devvi
AU - Al-Ash, Herley Shaori
AU - Syarofina, Sarah
AU - Saputra, Noval
N1 - Funding Information:
Universitas Indonesia funded this research with grant number NKB-030/UN2.F3/HKP.05.00/2021.
Publisher Copyright:
© 2022 the Author(s), licensee AIMS Press.
PY - 2022
Y1 - 2022
N2 - Microarray and RNA-sequencing (RNA-seq) techniques each produce gene expression data that can be expressed as a matrix that often contains missing values. Thus, a process of missing-value imputation that uses coherence information of the dataset is necessary. Existing imputation methods, such as iterative bicluster-based least squares (bi-iLS), use biclustering to estimate the missing values because genes are only similar under correlative experimental conditions. Also, they use the row average to obtain a temporary complete matrix, but the use of the row average is considered to be a flaw. The row average cannot reflect the real structure of the dataset because the row average only uses the information of an individual row. Therefore, we propose the use of Bayesian principal component analysis (BPCA) to obtain the temporary complete matrix instead of using the row average in bi-iLS. This alteration produces new missing values imputation method called iterative bicluster-based Bayesian principal component analysis and least squares (bi-BPCA-iLS). Several experiments have been conducted on two-dimension independent gene expression datasets, which are microarray (e.g., cell-cycle expression dataset of yeast saccharomyces cerevisiae) and RNA-seq (gene expression data from schizosaccharomyces pombe) datasets. In the case of the microarray dataset, our proposed bi-BPCA-iLS method showed a significant overall improvement in the normalized root mean square error (NRMSE) values of 10.6% from the local least squares (LLS) and 0.6% from the bi-iLS. In the case of the RNA-seq dataset, our proposed bi-BPCA-iLS method showed an overall improvement in the NRMSE values of 8.2% from the LLS and 3.1% from the bi-iLS. The additional computational time of bi-BPCA-iLS is not significant compared to bi-iLS.
AB - Microarray and RNA-sequencing (RNA-seq) techniques each produce gene expression data that can be expressed as a matrix that often contains missing values. Thus, a process of missing-value imputation that uses coherence information of the dataset is necessary. Existing imputation methods, such as iterative bicluster-based least squares (bi-iLS), use biclustering to estimate the missing values because genes are only similar under correlative experimental conditions. Also, they use the row average to obtain a temporary complete matrix, but the use of the row average is considered to be a flaw. The row average cannot reflect the real structure of the dataset because the row average only uses the information of an individual row. Therefore, we propose the use of Bayesian principal component analysis (BPCA) to obtain the temporary complete matrix instead of using the row average in bi-iLS. This alteration produces new missing values imputation method called iterative bicluster-based Bayesian principal component analysis and least squares (bi-BPCA-iLS). Several experiments have been conducted on two-dimension independent gene expression datasets, which are microarray (e.g., cell-cycle expression dataset of yeast saccharomyces cerevisiae) and RNA-seq (gene expression data from schizosaccharomyces pombe) datasets. In the case of the microarray dataset, our proposed bi-BPCA-iLS method showed a significant overall improvement in the normalized root mean square error (NRMSE) values of 10.6% from the local least squares (LLS) and 0.6% from the bi-iLS. In the case of the RNA-seq dataset, our proposed bi-BPCA-iLS method showed an overall improvement in the NRMSE values of 8.2% from the LLS and 3.1% from the bi-iLS. The additional computational time of bi-BPCA-iLS is not significant compared to bi-iLS.
KW - biclustering
KW - microarray
KW - normalized root mean square error
KW - RNA sequencing
UR - http://www.scopus.com/inward/record.url?scp=85132239703&partnerID=8YFLogxK
U2 - 10.3934/mbe.2022405
DO - 10.3934/mbe.2022405
M3 - Article
AN - SCOPUS:85132239703
SN - 1547-1063
VL - 19
SP - 8741
EP - 8759
JO - Mathematical Biosciences and Engineering
JF - Mathematical Biosciences and Engineering
IS - 9
ER -