In their model the authors used scaled dot product attention mechanism. The layer tries to find corresponding weight to each exercise  in order to correctly measure whether student will answer the question correctly or not.  They used multiple heads in order to gather the information from different subspaces. They use only first t interactions in order to predict t+1 interaction.

San Diego State University

1. Description
2. Embedding Layer
3. Feed Forward Layer
4. Residual Connection
5. Layer Normalization
6. Prediction Layer
7. Network Training

Proposed method  (A Self-Attentive model for Knowledge Tracing)

The input sequence is transformed to have a fixed size $$n$$. If the sample size is less than $$n$$, padding is applied; otherwise, inputs are divided and separate samples are generated. Then, an interaction embedding matrix $$M \in \mathbb{R}^{2E 	imes d}$$ is trained, which is used to obtain interaction embeddings $$s_i$$. An exercise embedding matrix $$E \in \mathbb{R}^{E 	imes d}$$ is trained in a similar manner. Position encoding is used in the model to encode the sequence order. A position embedding matrix $$P \in \mathbb{R}^{n 	imes d}$$ is learned during training. The position embeddings $$P_i$$ are added to the interaction embeddings. As a result of the embedding layer, embedded interaction and exercise matrices are generated.

Embedding Layer (A Self-Attentive model for Knowledge Tracing)

Self-Attention Layer (A Self-Attentive model for Knowledge Tracing)

The model predicts whether a student will answer a question correctly based on their previous sequence of interactions. The task is transformed into a sequential modeling problem where interaction pairs are represented as a single number: $$y_t = e_t + r_t \times E$$, where $$e_t$$ is the exercise, $$r_t$$ is the student's answer, and $$E$$ is the total number of exercises.

Learn Before

Related