BERT for Image Transformers (BEiT) : A Definitive Guide to Computer Vision Breakthrough

Convolutional Neural Networks (CNNs) have dominated the computer vision landscape for the longest time. The simplicity of capturing visual feature patterns using learned kernels and capability to downsample complex images has been a reason for its success in vision tasks.

CNNs locally apply learnable kernels on feature map patches to detect feature patterns. While being an effective strategy to capture complicated nuanced patterns in images, this method struggles to capture global context. Even on fundamental level, kernels used by CNNs are limited in terms of receptive fields.

Transformers in Computer Vision

Transformers have attained invincible status in Natural Language Processing (NLP) tasks in past years. They work on the principle of splitting the input sequences in smaller tokens or feature patches, and draw correspondences between these tokens. This helps in capturing global contexts as each feature patch finds correlations with every other feature patch, and learns to emphasize more attention on more distinct features.

This mechanism is called attention, and these attention units are the building blocks for transformers.

The ability to capture global contexts has attracted researchers to implement transformers in vision tasks. The pioneering research towards this, produced the Vision Transformer (ViT).

ViT processes the inputs in same way transformer-based networks process NLP data. The image is split into patches of equal size, these patches are processed simultaneously and correlations are drawn between them. It uses self-attention mechanisms to learn these correlations, and focuses on distinct features.

Limitations of ViT

Transformers have proved to be revolutionary step forward in NLP research. Excited by the fantastic results in NLP, researchers in computer vision have been experimenting ways to train transformers for vision data.

Vision Transformer was introduced in a paper titled An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, and is the pioneering step in training transformer networks for vision tasks. Training Vision Transformers directly on images of mid-sized datasets does not produce results comparable to ResNet architecture and its variants.

However, ViT shows excellent results when trained on images from large datasets.

Illustration of self-supervised learning

This exposes the limitation of Vision Transformers, they require huge amounts of data to gain an acceptable representation of the data which translates into on-par performance compared to state of the art architectures.

Architectures like ResNet, exhibit similar levels of performance with significantly smaller datasets. This is a great roadblock for computer vision tasks, as acquiring such large amounts of labeled data for images exhausts significant resources. This resource burden increases exponentially when it comes to vision tasks like object detection and image segmentation because labeling image data for these tasks is far more challenging than image classification.

For transformers to challenge the dominance of CNNs in computer vision tasks, its important to deploy mechanisms that allow them to learn on large amounts of unlabeled image data. Researchers have explored training transformers, beyond paradigms laid down, by supervised learning to resolve this issue.

Self-Supervised Training

Self-supervised learning is a developing machine-learning paradigm where a model learns to represent and understand input data without explicit labels. The learning task is designed in such a way that the model can learn the features and representation of the input data from the data itself.

For example, in natural language processing, a model might be trained to predict missing words in a sentence, and in computer vision, it could be trained to predict the relative spatial sequence of image patches.

Another such task can be predicting the rotation of an image, like what is depicted below. The idea is to pre-train the model on a self supervised task, which allows model to extract the feature representations from the data; this helps with knowledge transfer for training on downstream vision tasks.

A visual breakdown of self-supervised representation learning

The absence of labeled data in self-supervised learning often permeates a similarity to unsupervised learning methods. Although both paradigms dispel the requirement of labels, there are important differences to be noted.

Self-supervised learning tasks, although they do not need labels, still use feedback mechanism to facilitate the learning. The mechanisms are dependent on the input data itself. Unsupervised learning methods on the other hand utilize no such feedbacks; they have a heavy reliance on the model and are great for tasks like clustering and dimensionality reduction.

Self-supervised learning was explored because of  multiple shortcomings of the supervised learning methods:

  1. Resource burden: Supervised learning uses labeled data to train a model, and labeling data is a difficult and time consuming process. The financial investments in retrieving such label datasets is also very high due to manpower spent labeling.
  2. Data pipeline burden: State of the art machine learning models need extensive data pipelines which clean, filter, generalize, annotate, and restructure data to make it suitable for training.
  3. General AI: The model used by self supervised models can train a large amount of unlabeled data available on the internet without any requirements of label data.

BEiT Motivation

As discussed before, the downside to Vision Transformers and BERT models are their inability to gain meaningful representation of low-mid sized datasets. They need large datasets to produce results comparable to state-of-the-art CNNs. This typically happens because transformers  lack inductive biases such as translation equi-variance and locality by design.

As Vision Transformers grow bigger, they also need bigger datasets. The lack of such large annotated datasets in computer vision severely limit the capability of Vision Transformers.

To solve this, the BEiT (Bidirectional Encoder pre-training for image Transformer) paradigm was introduced. The idea proposed pre-training a BERT model in a self-supervised manner to alleviate the requirement of large labeled datasets. The pre-trained model can then be fine tuned for downstream vision tasks, such as object detection and segmentation using readily available smaller datasets.

The pre-training step allows the model to learn a deep representation of vision data using unlabeled datasets of millions of images available on the internet. This training helps models generalize over-expansive contexts of image data, which can then be easily fine-tuned to perform context-specific vision tasks.

The BERT model can then be considered as a general machine learning model and can be used off the shelf to fine tune, reducing the training time down to few hours and required labeled datasets down to a few thousand, all while still maintaining state of the art performance.

BEiT Architecture and Pre-Training

BEiT primarily follows the ViT architecture depicted below for image sequences. The difference is apparent in the training methodology. The BEiT model first pre-trains on huge image dataset in a self-supervised manner.

The pretext task for the self-supervised training is Masked Image Modeling. The model pre-trains to predict masked entities of an image sequence. This pre-training phase does not require labeled data, hence the model can train on a huge number of images.

This helps the model gain a reliable representation of image features and patterns; the model can then be easily trained to accomplish downstream tasks using smallerly labeled datasets. This knowledge transfer is facilitated by the pre-training step, and it provides the model the capability to generalize image data which is not necessarily suited to a particular task.

An image is worth 16X16 words: Transformers for image recognition at scale.

The backbone architecture for the BEiT model follows the ViT backbone architecture.

The input is a sequence of image patches [latex]\{x^p_i\}^N_{i=1}[latex]. The image patches are processed to compute their patch embeddings using linear projections [latex]$Ex^p_i$[latex], where [latex]$E \in R^{(P^2 C)\timesD}$[latex], these patch embeddings are nothing but lower dimensional projection of the input image.

The embedding can be considered as a feature encoding representation of the input image patch. As expected in a transformer based architecture, the image sequence of patches is prepended by a special token [S]. Each image patch is attached to a 1D learnable position embedding [latex] $E_{pos} \in R^{N \times D}$ parameter. The encoder contains L layers of Transformer blocks [latex]$H_l$[latex] = Transformer([latex]$H_{l−1}$ ), where l = 1, . . . , L. The transformer layers process these inputs and produce output vectors.

BEiT architecture

These output vectors are the encoded representation of the input patches and can be used as feature representation for the downstream tasks.

During training these encodings are tuned as a result of weight adjustment in transformer layer to accurately represent features of the image data in lower dimensions. The pre-training step helps with refining this feature representation to help the model better generalize.

During pre-training, the encodings are used to train for the pretext task for Masked Image Modeling.

Image Tokenization

As in natural language, image patches are represented as tokens discreet in nature obtained by an image tokenizer.

The tokens represent the encoding of the image patches and are easier for the model to learn instead of raw pixels. For BEiT, images are tokenized from [latex] $x \in R^{H \times W \times C}$[latex] to [latex] $z = [z_1, . . . , z_N ] \in V^{h \times w}$[latex], where the vocabulary V = {1, . . . , |V|}. Here the vocabulary is a visual codebook.

An example of the visual codebook

The tokenizer itself is trained by a discrete variational autoencoder (dVAE). The dVAE is trained to reconstruct an image, and has two components: tokenizer and decoder. The tokenizer is responsible to convert the image pixels into tokens. For BEiT the images are tokenized to a 14×14 map.

These tokens derived from images are visual tokens, worthy to note that the output size directly matches the number of image patches the input image sequence contains. The tokenizer used for BEiT is a publicly available tokenizer proposed in by Zero-shot text-to-image generation.

The visual tokens generated by the tokenizer are used to pre-train the BEiT architecture for masked image modeling tasks.

Masked Image Modeling

Self-supervised learning needs a pretext task to train a model. This task is a capability that the model tries to achieve after training and learning from the data.

An important thing to note is that this task has to be feasible without labels. This allows vast amounts of unlabeled data available publicly to be utilized by the model to learn the feature representation of images. The flexibility to use unlabeled data also removes the restriction to collect data in task-specific contexts. The data collected can be entirely general in nature, which also is a step closer to achieving general artificial intelligence.

In BEiT, the pretext task that a model trains on to perform before any vision-specific tasks is masked image modeling. The idea is to mask out random patches of the images from the sequence presented as the input to the transformer; this is also known as corruption of the input image.

The transformer is then trained to produce token representation of these patches in the output layer. The tokens computed by the transformer are compared to the visual token produced by the image tokenizer. The error is computed between the transformer-generated tokens and the visual tokens from the tokenizer, and then is then back-propagated through the network to adjust weights as in any machine learning pipeline.

Once trained, the model is capable of producing the correct visual tokens, and even reconstruct the tokens for the masked out patches. In other words, a model learns to reconstruct a corrupted image by generating correct tokens for the masked out patches.

In the process of this pre-training, the model gains a rich understanding of feature representation of huge image datasets presented to it. The knowledge about this feature representation is stored in the encoder module of the BEiT model, and can then be transferred to downstream tasks such as image classification, image segmentation and object detection.

Due to availability of prior knowledge, its easier to fine tune the model to these down stream tasks with much smaller datasets.

As in any transformer, An input image x is split into N image patches ([latex]\{x^p_i\}^N_{i=1}[latex]) and also tokenized into N visual tokens ([latex] $\{z_i\}^N_{i=1}$[latex]). Among these image patches 40% are masked before being fed into the model, [latex]$M \in {\{1, . . . , N\}}^{0.4N}$.

The masked or corrupted image patch sequence is [latex]$x_M = {\{x^p_i : i \notin M\}}^N_{i=1} \cup {\{e_{[M]} : i \in M\}}^N_{i=1}$[latex] which are then fed into transformer layers, if there are L number of layers, the model produces an encoding representation [latex]${\{h^L_i \}}^N_{i=1}$[latex]. This encoding then uses softmax activation to predict visual tokens [latex]$p_{MIM}(z’ | x_M) = {softmax}_{z^i} (W_c h^L_i +b_c)$[latex], where [latex]$x_M$[latex] is the masked image.

The model aims to maximize the log-likelihood of the encoding vector produced by model to the given correct visual token

Downstreaming Tasks and Results

Once pre-trained, the model can be trained further on specialized problems to solve inductive vision tasks like image segmentation and object detection. This process is called fine-tuning, and leverages the knowledge gained during the pre-training phase to converge on the downstream tasks quickly.

BEiT uses knowledge from pre-training task, to converge on image classification, segmentation and object detection faster. BEiT, when compared to vision transformers randomly initialized and trained on ImageNet-1K dataset, performs better on the image classification task. BEiT is also seen to perform better than self-supervised models.

For semantic segmentation, BEiT is compared to supervised pre-training, which utilizes labeled data for each pixel of the image. The dataset considered here is ImageNet as well.

BEiT once again performs better than supervised pre-trained networks, even though it does not need any labeled data during the pre-training phase.


Vision Transformers, though capable of replicating their success from the Transformers trained on NLP tasks, are required to train on gigantic datasets. In computer vision tasks, curating labeled datasets of such sizes is a resource burden that needs significant investments.

BEiT aims to resolve this bottleneck by pre-training vision transformers with unlabeled data using self-supervised techniques.

Self-supervised training utilizes models on a pretext task which does not require labels, such as masked image modeling, to help a model learn underlying feature representation of images. Since unlabeled data is utilized, publicly available images can be used directly without labeling, resulting in huge datasets.

BEiT uses Vision Transformer architecture and breaks the input image to form an image sequence as in ViT. Some patches in the image sequences are masked, and the transformer’s task is to predict a token for masked patches. Softmax activation is used to compare the predicted tokens by transformer with visual tokens generated by a publicly available image tokenizer.

BEiT outperforms existing state of the art vision transformers on image classification and semantic segmentation. This is due to the pre-training step which helps the model gain a extensive representation of features in the images of huge datasets. Once the knowledge of underlying feature representation is obtained, the model can then be further fine-tuned on downstream computer vision tasks, such as image classification, semantic segmentation and object detection.

For more insights from the Bolster Research Team, read previous work here.

To meet with members of the Bolster team, and learn more how our research and AI tools can work for your business, request a demo today.