# Vision Transformer
Who / What
The **Vision Transformer** (ViT) is a machine learning model specifically designed for computer vision tasks. It processes images by dividing them into fixed-size patches, converting each patch into a vector representation, and then applying transformer-based encoding to analyze these embeddings as if they were sequential tokens.
---
Background & History
The Vision Transformer was introduced in 2020 as an extension of the original Transformer architecture originally developed for natural language processing (NLP). Inspired by how transformers process text sequences, researchers adapted this model to handle visual data by treating images as collections of patch-based embeddings. The concept emerged from advancements in self-supervised learning and attention mechanisms, where ViTs demonstrated improved performance on tasks like image classification compared to traditional convolutional neural networks (CNNs).
---
Why Notable
ViT has revolutionized computer vision by introducing a novel approach that leverages the strengths of transformers—particularly their ability to capture long-range dependencies in data. Unlike CNNs, which rely on local spatial hierarchies, ViTs process images as sequences of patches, enabling them to model global relationships more effectively. This innovation has led to record-breaking accuracy across benchmarks like ImageNet and has spurred further research into hybrid architectures (e.g., combining CNNs with transformers).
---
In the News
As a foundational model in modern vision AI, ViT remains influential in both academic and industrial settings. Recent advancements include larger-scale variants (e.g., Vision Transformers with billions of parameters) that push boundaries in tasks like object detection, segmentation, and zero-shot learning. Its adaptability has also led to applications in medical imaging, autonomous systems, and generative AI, ensuring sustained relevance in the evolving field.
---
Key Facts
---