The growth of information in the last two decades is dominated by multimedia data such as text, image, audio, and video. Multimedia data with low-level features should be represented in a high-level concept that is easily understood by a human. Classification of the multi-format object is a technique that is used to represent a multi-source object like text, image, audio, and video at once. The object t features are extracted then categorized in several specified classes or concepts. This paper adopts Deep Learning Techniques: (1) Convolutional Neural Networks (CNN) techniques for classifying an image, audio, and video, (2) Recurrent Neural Networks (RNN) technique for classifying text. The experiment uses small data of Indonesian cultural heritage domain. As supervised learning form, the output model is grouped into five classes based on Indonesian ethnic groups (Toraja, Bali, Batak, Dayak, Betawis. The result, this classification model can be implemented in the Multimedia Information Retrieval System and Recommender System for Indonesia Cultural Heritage.