The left ventricular of ejection fraction is one of the most important metric of cardiac function. It is used by cardiologist to identify patients who are eligible for life-prolonging therapies. However, the assessment of ejection fraction suffers from inter-observer variability. To overcome this challenge, we propose a deep learning approach, based on hierarchical vision Transformers, to estimate the ejection fraction from echocardiogram videos. The proposed method can estimate ejection fraction without the need for left ventrice segmentation first, make it more efficient than other methods. We evaluated our method on EchoNet-Dynamic dataset resulting 5.59, 7.59 and 0.59 for MAE, RMSE and R2 respectivelly. This results are better compared to the state-of-the-art method, Ultrasound Video Transformer (UVT). The source code is available on https://github.com/lhfazry/UltraSwin.