Unlike the original Transformer model, masking is applied to all multi-head attention layers in the proposed model. Because only information from previous interactions is needed for this problem (i.e., $$e_1, \cdots, e_n$$ and $$l_1, \cdots, l_{n-1}$$), masking is used across all layers.

San Diego State University

Google

1. Input Representation
2. Model Description
3. Subsequent Masking

Propose Method (Deep Attentive Study Session Dropout Prediction in Mobile Learning Environment)

Subsequent Masking (Deep Attentive Study Session Dropout Prediction in Mobile Learning Environment)

The proposed model has a similar structure to the transformer, consisting of $$n$$ stacked encoders and decoders. The encoder consists of self-attention and feedforward layers; after each of these layers, there is a residual connection and layer normalization. Similarly, decoders contain the same layers with an additional fully connected layer for prediction. The encoder and decoders have distinctive inputs: for the encoder we have $$e_i$$s (question metadata features), and for the decoders we have response features shifted by one ($$(S, l_1, \cdots, l_{n-1})$$) and the output of the encoder. In their model, the authors use MultiHeadAttention and upper triangular masking.

Model Description (Deep Attentive Study Session Dropout Prediction in Mobile Learning Environment)

As an input to the model, the features include question metadata ($$e_i$$) and response metadata ($$l_i$$). Their embedding vectors are defined as:

$$e_i = emb_e(id_i, c_i, st_i, p_i, sp_i)$$

$$l_i = emb_l(r_i, et_i, st_i, iot_i, d_i, p_i, sp_i)$$

Learn Before

Related