It is widely known that visual cues play an important role in speech, especially in disambiguating confusable phonemes or as a means for 'hearing' visually. Interpreting speech only through visual signal is called lip reading. Lip reading has several potential application as a complementary modality to speech recognition or as purely visual speech recognition, which gives rises to silent speech interface, which by itself has numerous practical application. Although the overwhelming potential of such system, research on lip reading for the Indonesian language was extremely limited, with settings still very distant from the real world. This research is an attempt to make a lip reading model that has the potential to be applicable in the real world, specifically by building a lip reading model that supports a variable-length sentence as its input We build the model using deep learning, specifically spatiotemporal Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) that both respectively form spatiotemporal feature extractor and character-level sentence decoder. During the process, we also investigate whether knowledge on lip reading on other language affects the acquisition of a different language. To the best of our knowledge, our model was the first sentence level Indonesian language lip reading that supports variable-length input. Our model achieved superhuman performance on all metrics, with almost 20× better word accuracy.