Visual transformer with depthwise separable convolution projections for video-based human action recognition

Open Access

Issue		MATEC Web Conf. Volume 413, 2025 International Conference on Measurement, AI, Quality and Sustainability (MAIQS 2025)


Article Number		06003
Number of page(s)		5
Section		Artificial Intelligence in Societies
DOI		https://doi.org/10.1051/matecconf/202541306003
Published online		01 October 2025

R. Poppe, A Survey on Vision-Based Human Action Recognition, Image and Vision Computing, 28, 976–990 (2010). https://doi.org/10.1016/j.imavis.2009.11.014 [Google Scholar]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features With 3D Convolutional Networks, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), (2015). https://doi.org/10.1109/ICCV.2015.510 [Google Scholar]
M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for Human Action Recognition, Neurocomputing, 396, 224–229 (2020). https://doi.org/10.1016/j.neucom.2018.10.095 [Google Scholar]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, Video Swin Transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3202–3211 (2022). https://doi.org/10.1109/CVPR52688.2022.00320 [Google Scholar]
S. Khan, M. Naseer, M. Hayat, S. Zamir, F. Khan, and M. Shah, Transformers in Vision: A Survey, ACM Computing Surveys (CSUR), 54, 1–41(2022). https://doi.org/10.1145/3505244 [CrossRef] [Google Scholar]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, and J. Uszkoreit, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv preprint arXiv:2010.11929 (2020). [Google Scholar]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021). https://doi.org/10.1109/ICCV48922.2021.00986 [Google Scholar]
M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, Do Vision Transformers See Like Convolutional Neural Networks?, Advances in Neural Information Processing Systems, 34, 12116-12128 (2021). [Google Scholar]
D. Cui, C. Xin, L. Wu, and X. Wang, ConvTrans- former Attention Network for Temporal Action Detection, Knowledge-Based Systems, 300, 112264 (2024). https://doi.org/10.1016/j.knosys.2024.112264 [Google Scholar]
P. Gu, Y. Zhang, C. Wang, and D. Chen, ConvFormer: Combining CNN and Transformer for Medical Image Segmentation, arXiv preprint arXiv:2211.08564 (2022). [Google Scholar]
F. Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, arXiv preprint arXiv:1610.02357 (2017). [Google Scholar]
H.-B. Zhang, Y.-X. Zhang, B. Zhong, Q. Lei, L. Yang, J.-X. Du, and D.-S. Chen, A Comprehensive Survey of Vision-Based Human Action Recognition Methods, Sensors, 19, 1005 (2019). https://doi.org/10.3390/s19051005 [Google Scholar]
H. Shabani, D. Clausi, and J. Zelek, Towards a Robust Spatio-Temporal Interest Point Detection for Human Action Recognition, in Proceedings of the International Conference on Image Processing (ICIP), pp. 237-243 (2009). [Google Scholar]
A. F. Bobick and J. W. Davis, The Recognition of Human Movement Using Temporal Templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 257–267(2001). https://doi.org/10.1109/34.910878 [Google Scholar]
A. Chaaraoui, P. Climent-Pérez, and F. Flórez- Revuelta, Silhouette-Based Human Action Recognition Using Sequences of Key Poses, Pattern Recognition Letters, 34, 1799–1807(2013). https://doi.org/10.1016/j.patrec.2013.01.021 [Google Scholar]
C. Feichtenhofer, H. Fan, J. Malik, and K. He, Slow- Fast Networks for Video Recognition, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6202–6211 (2019). https://doi.org/10.1109/ICCV.2019.00630 [Google Scholar]
K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 [Google Scholar]
J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017). https://doi.org/10.1109/CVPR.2017.502 [Google Scholar]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, MobileNetV2: Inverted Residuals and Linear Bottlenecks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474 [Google Scholar]
A. Krizhevsky, I. Sutskever, and G. Hinton, Ima- geNet Classification with Deep Convolutional Neural Networks, Communications of the ACM, 60, 84–90(2017). https://doi.org/10.1145/3065386 [Google Scholar]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, Attention is All You Need, Advances in Neural Information Processing Systems, 30 (2017). [Google Scholar]
C. Zhang, ConvFormer: Tracking by Fusing Convolution and Transformer Features, IEEE Access, 11, 74855–74864(2023). https://doi.org/10.1109/ACCESS.2023.3293592 [Google Scholar]
T. Kumar, R. Brennan, A. Mileo, and M. Bendechache, Image Data Augmentation Approaches: A Comprehensive Survey and Future Directions, IEEE Access, (2024). [Google Scholar]
K. Soomro, A. RoshanZamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild, arXiv preprint arXiv:1212.0402 (2012). [Google Scholar]
I. Loshchilov and F. Hutter, Decoupled Weight Decay Regularization, arXiv preprint arXiv:1711.05101 (2017). [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.