What is Vision Transformer?

Computer Vision

Vision Transformer

A Vision Transformer (ViT) applies the Transformer architecture to image recognition by treating image patches as tokens. ViTs have matched or exceeded CNNs on many computer vision benchmarks.

Understanding Vision Transformer

The Vision Transformer (ViT) adapts the transformer architecture from natural language processing to computer vision by treating images as sequences of fixed-size patches, each linearly embedded as a token. This approach demonstrated that pure self-attention models, without any convolutional layers, can achieve state-of-the-art results on image classification benchmarks when trained on sufficient data. ViT divides an image into a grid of patches (typically 16x16 pixels), flattens them into vectors, and processes them through standard transformer encoder layers with multi-head attention. The architecture scales exceptionally well, with larger models and datasets yielding continued performance improvements. Vision transformers have expanded beyond classification to object detection, semantic segmentation, and image generation, challenging the long-standing dominance of CNNs in computer vision tasks.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Watermarking

Back to full glossary

Vision Transformer

Understanding Vision Transformer

Is AI recommending your brand?

Related Computer Vision Terms

Bounding Box

Computer Vision

Face Recognition

Image Captioning

Image Classification

Image Segmentation

Instance Segmentation

Masked Autoencoder