Enhancing Image Classification with Vision Transformers: A Comparative Study with CNN Models
DOI:
https://doi.org/10.7492/3a3y2090Abstract
Image classification is a fundamental task in computer vision, traditionally dominated by Convolutional Neural Networks (CNNs) due to their capacity to efficiently learn spatial hierarchies through convolutional layers. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm, leveraging self-attention mechanisms to capture global contextual information more effectively than traditional convolutions. This research presents a comprehensive comparative study between CNNs and ViTs across standard datasets—CIFAR-10, CIFAR-100, and ImageNet. We evaluate performance based on accuracy, inference time, memory consumption, and interpretability. The results indicate that ViTs outperform CNNs in terms of classification accuracy, particularly on high-resolution datasets like ImageNet, where global attention mechanisms are advantageous. Conversely, CNNs demonstrate superior efficiency in terms of inference speed and memory usage, making them more suitable for low-resolution tasks and resource-constrained environments. Swin Transformer emerged as the top performer among ViTs, combining hierarchical attention with competitive efficiency. Additionally, ViTs provided more transparent interpretability through attention maps, highlighting global feature dependencies. The study identifies critical research gaps, including data efficiency, memory optimization, and hybrid architectures, and proposes future directions to bridge the gap between CNN efficiency and ViT global context modeling. This analysis serves as a foundation for optimizing image classification models, advancing the development of more robust and efficient deep learning architectures.