Attention mechanism is achieved  through kernelization.

- Performers
- Linear Transformers 
- Random Feature Attention


San Jose State University

Primary goal of an efficient transformer model is to improve the memory complexity of the self attention mechanism.The different methods or patterns that significantly improves the efficiency can be classified as shown below

- Fixed Patterns (FP) 
	 - Blockwise Patterns
         - Strided Patterns
         - Compressed Patterns
- Combination of Patterns (CP)
- Learnable Patterns (LP)
- Neural Memory
- Low-Rank Methods
- Kernel
- Recurrence
- Downsampling
- Sparse Models and Conditional Computation 


Taxonomy of Efficient Transformers

Efficient Transformers: A Survey


Sparsification of attention matrix by limiting the field of view.
Field of view can be fixed , predefined patterns such as local windows and block patterns of fixed strides. 

 - Blockwise Patterns
 - Strided Patterns
 - Compressed Patterns 


Transformer models using Fixed Patterns

Combination of different access patterns also improves efficiency of the model.

Sparse Transformer
Axial Transformer


Transformer models using Combination of Patterns (CP)

Notion of token relevance  is determined in a data driven fashion and are then assigned to buckets or chunks.

 - Routing Transformer
 - Reformer


Transformer patterns using Learnable patterns

Leverage a learnable side memory module that can access multiple tokens at once. 

- Set Transformers
- ETC
- Longformer


Transformer models using  Neural Memory

n this method efficiency is improved using low rank approximation of self attention matrix.

- Linformer


Transformer models using Low-Rank Methods

Transformer models using Kernels

A natural extension to the blockwise method is to connect these blocks via recurrence.

- Transformer-XL (Dai et al., 2019) 


Transformer models using Recurrence

Reduction of resolution of sequence

- Perceiver
- Funnel Transformers (Dai et al., 2020)
- Swin Transformer (Liu et al., 2021b)
- Charformer (Tay et al., 2021c)


Transformer models using Downsampling

Sparse models sparsely activate a subset of the parameters which generally improves the parameter to FLOPs ratio

Switch Transformers (Fedus et al., 2021), 
ST-MoE (Zoph et al., 2022), 
GShard (Lepikhin et al., 2020), 
Product-Key Memory Layers (Lample et al., 2019)


Learn Before

Related