These types of models deal with input sequences at fixed intervals.

Examples of Models using this technique are 

Sparse Transformer (Child et al., 2019) and/or Longformer (Beltagy et al., 2020) employ strided or “dilated” windows.


Transformer models using Strided Patterns

Usage of some pooling operator to down-sample the sequence length to be a form of fixed pattern.

Compressed Attention (Liu et al., 2018) uses strided convolution to effectively reduce the sequence length.


Transformer models using Compressed Patterns

Input sequences are converted into fixed groups of words which form local receptive fields. Chunking input sequences into blocks reduces the attention complexity from $$N^2$$ to $$B^2$$ (where $$B$$ is the block size and $$B \ll N$$), significantly reducing the computational cost. Examples of models using this technique are Blockwise (Qiu et al., 2019) and Local Attention (Parmar et al., 2018).

Transformer Models using Blockwise Patterns

Sparsification of attention matrix by limiting the field of view.
Field of view can be fixed , predefined patterns such as local windows and block patterns of fixed strides. 

 - Blockwise Patterns
 - Strided Patterns
 - Compressed Patterns 


San Jose State University

Primary goal of an efficient transformer model is to improve the memory complexity of the self attention mechanism.The different methods or patterns that significantly improves the efficiency can be classified as shown below

- Fixed Patterns (FP) 
	 - Blockwise Patterns
         - Strided Patterns
         - Compressed Patterns
- Combination of Patterns (CP)
- Learnable Patterns (LP)
- Neural Memory
- Low-Rank Methods
- Kernel
- Recurrence
- Downsampling
- Sparse Models and Conditional Computation 


Taxonomy of Efficient Transformers

Efficient Transformers: A Survey


Transformer models using Fixed Patterns

Combination of different access patterns also improves efficiency of the model.

Sparse Transformer
Axial Transformer


Transformer models using Combination of Patterns (CP)

Notion of token relevance  is determined in a data driven fashion and are then assigned to buckets or chunks.

 - Routing Transformer
 - Reformer


Transformer patterns using Learnable patterns

Leverage a learnable side memory module that can access multiple tokens at once. 

- Set Transformers
- ETC
- Longformer


Transformer models using  Neural Memory

n this method efficiency is improved using low rank approximation of self attention matrix.

- Linformer


Transformer models using Low-Rank Methods

Attention mechanism is achieved  through kernelization.

- Performers
- Linear Transformers 
- Random Feature Attention


Transformer models using Kernels

A natural extension to the blockwise method is to connect these blocks via recurrence.

- Transformer-XL (Dai et al., 2019) 


Transformer models using Recurrence

Reduction of resolution of sequence

- Perceiver
- Funnel Transformers (Dai et al., 2020)
- Swin Transformer (Liu et al., 2021b)
- Charformer (Tay et al., 2021c)


Transformer models using Downsampling

Sparse models sparsely activate a subset of the parameters which generally improves the parameter to FLOPs ratio

Switch Transformers (Fedus et al., 2021), 
ST-MoE (Zoph et al., 2022), 
GShard (Lepikhin et al., 2020), 
Product-Key Memory Layers (Lample et al., 2019)


Learn Before

Related

Learn After