1. TRPO performs better than the other baselines used on **GPL**.
2. TRPO sometimes outperforms the other baselines on **EFC** & **HLR** and vise-versa.
3. TRPO performs worse that threshold policy (this was described as unsurprising as threshold has an access to the latent parameters).

The authors claimed that the fact that TRPO sometimes performs worse that **EFL** and **HLR** is interesting. Thy state that additional hyperparameters  and policies can increase TRPO performance.

San Diego State University

1. Research Question
2. Environment
3. Baselines
4. Implementation details
5. Analysis

Experiments
(Accelerating Human Learning With Deep Reinforcement Learning)

**EFC** (Exponential Forgetting Curve), **HLR** (Half-Life Regression) , **GPL** (Generalized Power Law)  models are implemented as OpenAI Gym. environments.

Environments
(Accelerating Human Learning With Deep Reinforcement Learning)

Analysis
(Accelerating Human Learning With Deep Reinforcement Learning)

The authors compare TPRO to other four baseline schedulers:
1. Random Policy
2. Leitner System
3.  Variant SuperMemo
4.  Threshold-based Policy

As it was described in the research paper a threshold-based policy has direct access the student simulator's parameters when it calculates recall likelihoods, and it was used as an upper bound in the authors experiments.

Baselines
(Accelerating Human Learning With Deep Reinforcement Learning)

Can the described deep reinforcement learning (DRL) scheduling algorithm help students achieve their educational objectives better than baseline methods, under various assumptions about student learning?

Research Question (Accelerating Human Learning With Deep Reinforcement Learning)

These are the parameters used by the authors for their experiments: 1. $$ n = 30 $$ (number of items) 2. $$ T = 200 $$ (number of steps) 3. $$ D = 5 $$ (delay between steps in seconds) 4. For the **EFC** (Exponential Forgetting Curve) student model, the sample item difficulty ($$ \theta $$) is from the distribution: $$ \log\theta \sim \mathcal{N}(\log(0.077), 1) $$ 5. For the **HLR** (Half-Life Regression) student model: $$ \overrightarrow{\theta} = (1, 1, 0, \theta_3) $$ where $$ \theta_3 \sim \mathcal{N}(0, 1) $$; and $$ \overrightarrow{x_i} = $$ (number of attempts, number correct, number incorrect, one-hot encoding of item $$ i $$ out of $$ n $$ items). 6. For the **GPL** (Generalized Power-Law) student model: $$ a = \overrightarrow{\alpha} = 0 $$; $$ d \sim \mathcal{N}(1, 1) $$; $$ \log \overrightarrow{d} \sim \mathcal{N}(0, 1) $$; $$ \log r \sim \mathcal{N}(0, 0.01) $$; $$ W = 5 $$; $$ \theta_{2w} = \theta_{2w - 1} = \frac{1}{\sqrt{W} - w + 1} $$ 7. For the TRPO algorithm, the batch size is 4000, $$ \gamma = 0.99 $$, and the step size is 0.01. 8. For the Recurrent Network Policy, the number of hidden layers is 32.

Learn Before

Related