**Truncated Natural Policy Gradient** (TNPG) is the algorithm for computing gradient direction. It only needs to calculate $I(\theta)v$ where:

1. $I (\theta)$ is Fisher Information Matrix
2. v is arbitrary vector.

TNPG improves natural policy gradient but it has a high computational cost. Usually, they are efficient when the space of the parameters is high dimensional. 

San Diego State University

1. Models of Human Memory
2. Spaced Repetition
3. Leitner System
4. Supermemo System
5. Intelligent Tutoring System
6. Relation between Tutoring Systems and Student Learning
7. Reinforcement Learning
8. Trust Region Policy Optimization
9. Truncated Natural Policy Gradient
10. Recurrent Neural Networks

Related Theory (Using deep reinforcement learning for personalizing review sessions on e-learning platforms with spaced repetition)

Here the authors describe three previous probabilistic methods of the human memory:

**Exponential Forgetting Curve**  - one of the oldest human memory models.
The authors use parameterization where user either forgets information completely or retains it and the probability of the recall can be expressed with this function:  

"where θ ∈ R+ is the item difficulty, D ∈ R+ is the time elapsed since the item was last reviewed by the student, and S ∈ R+ is the student’s memory strength for the item." S is set to the amount of trials. 

                                       $$Z ~ Bernoulli(exp(-\theta \frac {D}{S}))$$
**Half-Life Regression** - here unlike **exponential forgetting curve** we don't have the parameter for the difficulty here, instead here we have ~x ∈ X which contains information about student's study history and model parameters θ ∈ Θ . This is how formula looks like:
                                 $$ S = exp(\overrightarrow { \rm \theta} \overrightarrow {\rm x})$$

In order to encode the number of attempts, correct/incorrect answers and the identity  the authors set $X = \N ^{3} \times {0, 1} ^{n}$. By dropping difficulty of the item we are not losing any information as the difficulty "is absorbed into the memory strength via the coefficients of the item identity indicator features."

**Generalized Power Law** (paper describes this part precisely, couldn't be broken down into smaller parts) here we have different likelihood for the recall:


On the last formula from the picture we have a and d  which are parameters for the student ability and the item difficulty.


Background
(Accelerating Human Learning With Deep Reinforcement Learning)

Reviewing or practicing over time can have better results than reviewing in short periods of time.

Spaced Repetition

The method used in flashcards for the spaced repetition. 

**The Goal** focus more on those items with which students have more difficulties to recall and spend less time on the ones that student recalls.

Leitner System

In reinforcement learning, the machine goes through trial and error processes where it is rewarded/penalized for the actions it performs. The machine's goal is to maximize the total reward by leveraging previous attempts to make the next decision. The model starts from completely random trials and eventually leads to something with very sophisticated tactics and skills. Compared to supervised/unsupervised learning, reinforcement learning is much more advanced. A reinforcement model will continuously learn, unlike supervised/unsupervised models which all have an endpoint after the training and test data phases.

Reinforcement Learning

In the 21st century MOOC (Massive Open Online Courses) have proliferated like Coursera, Udemy, Edx and several applications like Duolingo, Quizzlet became quite popular.  The researchers have a keen interest to acquire this systems with more intelligence, so that learning process could be overseen. 

As it was described in the paper Computer-Assisted Instruction can increase students performance by 0.3 std while human tutors by 2. 

Intelligent Tutoring Systems combine AI, cognitive science and educational theory which makes researches in this field arduous.  

In the earlier studies 3 main modules have been defined in ITS:

1. **The expert Knowledge module** - this module can be thought as a source of the knowledge. The purpose of this module is to generate questions, answers and the solutions for the particular problem.

2. **The student model module**  - the student skills and knowledge which varies constantly. It is very important for the **tutor module** to know about students current level and skills.

3. **The tutor module** - the teaching strategy. This module has to decide which lesson to present to the student.

Recently, fourth module was added **The user interface module**:



Intelligent Tutoring Systems (Using deep reinforcement learning for personalizing review sessions on e-learning platforms with spaced repetition)

It is extremely important for the tutoring systems to realize the student's current knowledge level and predict student's ability to answer/solve a particular task or question correctly and modify student's knowledge level accordingly. The tutoring system has to take care of providing questions. System should take lots of factors into account like: 

1. Duration (to learn material)
2. Manner (to learn material)

Relation between Tutoring Systems and Student learning

TRPO is an optimization algorithm in reinforcement learning which uses gradient descent. TRPO builds an algorithm that is stable and guarantees monotonic improvement. "This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks"

TRPO has  better performance than the vanilla policy gradients as the length of the step size is easily defined. Additionally, it takes advantage of using old policies sampled distributions for optimizing new ones.



Trust Region Policy Optimization

Truncated Natural Policy Gradient

Recurrent neural networks, or RNNs, are a family of neural networks for processing sequential data. A recurrent neural network is specialized for processing a sequence of inputs $x^{(1)}, . . . , x^{(τ)}$ and each time add additional layers of comprehension on top of the previous inputs. 


Recurrent Neural Network (RNN)

The SuperMemo 2 (SM-2) algorithm calculates intervals for spaced repetition learning. The steps are:

1. Split knowledge into distinct items.
2. Initialize E-Factor (easiness of memorizing) to 2.5 for all items.
3. Repeat items using the following intervals in days, where $$I(n)$$ is the interval after the $$n$$-th repetition:
- $$I(1) := 1$$
- $$I(2) := 6$$
- For $$n > 2$$: $$I(n) := I(n-1) \times EF$$
4. Assess the quality of the response ($$q$$) on a 0-5 scale: 5 (perfect), 4 (correct after hesitation), 3 (correct but difficult), 2 (incorrect, but correct answer seemed easy to recall), 1 (incorrect, but correct answer remembered), 0 (complete blackout).
5. Update the E-Factor using the formula: $$EF_{new} := EF + (0.1 - (5-q) \times (0.08 + (5-q) \times 0.02))$$. If $$EF_{new} < 1.3$$, set $$EF_{new} = 1.3$$.
6. If $$q < 3$$, repeat the item without changing the E-Factor.
7. Repeat the session until a grade of at least $$q = 4$$ is achieved for all items.

Learn Before

Related