Learn Before
Concept

Patch Embedding in Vision Transformers

To implement a vision Transformer, the input image must be divided into smaller regions called patches. The process of splitting an image into patches and linearly projecting these flattened patches is known as patch embedding. This entire operation can be simplified and implemented as a single two-dimensional convolution operation, where both the kernel size and the stride size are set strictly equal to the patch size.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L

Related