Input sequences are converted into fixed groups of words which form the local receptive fields.. Chunking input sequences into blocks reduces the complexity from $N^2$ to $B^2$  (block size) with B << N, significantly reducing the cost.

Examples of Models using this technique are 

Blockwise (Qiu et al., 2019) and/or Local Attention (Parmar et al., 2018). 


San Jose State University

Sparsification of attention matrix by limiting the field of view.
Field of view can be fixed , predefined patterns such as local windows and block patterns of fixed strides. 

 - Blockwise Patterns
 - Strided Patterns
 - Compressed Patterns 


Transformer models using Fixed Patterns

Efficient Transformers: A Survey


Transformer models using Blockwise Patterns

These types of models deal with input sequences at fixed intervals.

Examples of Models using this technique are 

Sparse Transformer (Child et al., 2019) and/or Longformer (Beltagy et al., 2020) employ strided or “dilated” windows.


Transformer models using Strided Patterns

Usage of some pooling operator to down-sample the sequence length to be a form of fixed pattern.

Compressed Attention (Liu et al., 2018) uses strided convolution to effectively reduce the sequence length.


Learn Before

Related