We can recognize a person by his voice alone. In principle, the sound has a tone (pitch) that is different for each person. This study aims to measure a Deep Neural Network (DNN) performance with static and dynamic prosodic features. Prosodic is information about sound related to tone, intonation, pressure, duration, and rhythm of a person's pronunciation. The data is dictated and spontaneous voice data that taken from YouTube. The data consists of three male voices and one female voice. The data is segmented into various duration, 3 seconds,5 seconds, and 10 seconds. After the data has been segmented, extracted static prosodic features with 103 dimensions and dynamic prosodic features with 13 dimensions. Each feature and feature combination are trained and tested using DNN with a ratio of 90:10. The result shows that the 10 seconds segmented data has higher accuracy than the others. Accuracy of static prosodic features is better than dynamic prosodic features. The average accuracy of DNN for static prosodic features is 87.02%. The average accuracy of DNN for dynamic prosodic features is 72.97%. The average accuracy of DNN for combined static and dynamic prosodic features is 87.72 %.