TY - GEN
T1 - Visual-only word boundary detection
AU - Maulana, Muhammad Rizki Aulia Rahman
AU - Larasati, Retno
AU - Fanany, Mohamad Ivan
N1 - Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.
AB - Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.
KW - 3-Dimensional convolutional neural network
KW - Speech recognition
KW - Word boundary detection
KW - Word segmentation
UR - http://www.scopus.com/inward/record.url?scp=85034230538&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-69456-6_13
DO - 10.1007/978-3-319-69456-6_13
M3 - Conference contribution
AN - SCOPUS:85034230538
SN - 9783319694559
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 150
EP - 161
BT - Multi-disciplinary Trends in Artificial Intelligence - 11th International Workshop, MIWAI 2017, Proceedings
A2 - Phon-Amnuaisuk, Somnuk
A2 - Ang, Swee-Peng
A2 - Lee, Soo-Young
PB - Springer Verlag
T2 - 11th Multi-disciplinary International Workshop on Artificial Intelligence, MIWAI 2017
Y2 - 20 November 2017 through 22 November 2017
ER -