Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.