Tourism sector has become one of the most potential income for some countries. One of the ways to increase income from tourism sector is to implement information technology so it can attract more tourists to come. The technology that can be implemented is smart tourism. One of the smart tourism implementations for Indonesia tourism, especially for Monas tourism destination is mobile based Visual Question Answering (VQA) application that can provide detailed information about tourism object from mobile phone camera. Focus of this thesis is to produce training model with good detection accuracy. The result of the model training process will be used as model for object detection model that will be used for doing VQA. Dataset that will be used for this research are 600 pictures containing Monas and 25 surrounding objects called class. The methods that will be used for object detection is using YOLO and RetinaNet, where both of these methods will be compared each other by searching the accuracy from evaluation metric from both method. By using original dataset, in YOLO the mean average precision (mAP) score is between 60.77% to 71.99% with 0.1 to 0.9 confidence level threshold range and in RetinaNet the mAP score is between 72.18% to 92.98%. By using augmented dataset, in YOLO the mAP score is between 52.51% to 93.72% and in RetinaNet the mAP score is between 23, 8% to 56, 19%. The Area Under Curve (AUC) score for original dataset is 0.99 and 0.96 for augmented dataset using YOLO method. Based on the evaluation result, YOLO can detect objects better than RetinaNet and augmented dataset can produce better detection than original dataset.