TY - GEN
T1 - Sentence-level Indonesian lip reading with spatiotemporal CNN and gated RNN
AU - Maulana, Muhammad Rizki Aulia Rahman
AU - Fanany, Mohamad Ivan
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - It is widely known that visual cues play an important role in speech, especially in disambiguating confusable phonemes or as a means for 'hearing' visually. Interpreting speech only through visual signal is called lip reading. Lip reading has several potential application as a complementary modality to speech recognition or as purely visual speech recognition, which gives rises to silent speech interface, which by itself has numerous practical application. Although the overwhelming potential of such system, research on lip reading for the Indonesian language was extremely limited, with settings still very distant from the real world. This research is an attempt to make a lip reading model that has the potential to be applicable in the real world, specifically by building a lip reading model that supports a variable-length sentence as its input We build the model using deep learning, specifically spatiotemporal Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) that both respectively form spatiotemporal feature extractor and character-level sentence decoder. During the process, we also investigate whether knowledge on lip reading on other language affects the acquisition of a different language. To the best of our knowledge, our model was the first sentence level Indonesian language lip reading that supports variable-length input. Our model achieved superhuman performance on all metrics, with almost 20× better word accuracy.
AB - It is widely known that visual cues play an important role in speech, especially in disambiguating confusable phonemes or as a means for 'hearing' visually. Interpreting speech only through visual signal is called lip reading. Lip reading has several potential application as a complementary modality to speech recognition or as purely visual speech recognition, which gives rises to silent speech interface, which by itself has numerous practical application. Although the overwhelming potential of such system, research on lip reading for the Indonesian language was extremely limited, with settings still very distant from the real world. This research is an attempt to make a lip reading model that has the potential to be applicable in the real world, specifically by building a lip reading model that supports a variable-length sentence as its input We build the model using deep learning, specifically spatiotemporal Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) that both respectively form spatiotemporal feature extractor and character-level sentence decoder. During the process, we also investigate whether knowledge on lip reading on other language affects the acquisition of a different language. To the best of our knowledge, our model was the first sentence level Indonesian language lip reading that supports variable-length input. Our model achieved superhuman performance on all metrics, with almost 20× better word accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85051143790&partnerID=8YFLogxK
U2 - 10.1109/ICACSIS.2017.8355061
DO - 10.1109/ICACSIS.2017.8355061
M3 - Conference contribution
AN - SCOPUS:85051143790
T3 - 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
SP - 375
EP - 380
BT - 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017
Y2 - 28 October 2017 through 29 October 2017
ER -