Let's start with a quick recap of the basic equations in RNN:
h(t)=g1(Wh(t−1)⊕Ux(t)+bh)
y^(t)=g2(Vh(t)+by)
J(t)=L(y^(t),y(t)) and J=T1∑t=1TJ(t)
We want to calculate the gradients of the loss function with parameters U,V, and W. We can sum up the gradients at each step since ∂W∂J=T1∑t=1T∂W∂J(t). Thus we only need to find ∂W∂J(t), ∂V∂J(t), and ∂U∂J(t).
Derivative of V only depends on the current step:
∂V∂J(t)=∂y^(t)∂J(t)∂V∂y^(t)
But it is not the case for derivative of U and W, take t=3 as an example:
∂W∂J(3)=∑k=13∂y^(3)∂J(3)∂h(3)∂y^(3)∂h(k)∂h(3)∂W∂h(k)
=∑k=13∂y^(3)∂J(3)∂h(3)∂y^(3)(∏j=k+13∂h(j−1)∂h(j))∂W∂h(k)
and
∂U∂J(3)=∑k=13∂y^(3)∂J(3)∂h(3)∂y^(3)∂h(k)∂h(3)∂U∂h(k)
=∑k=13∂y^(3)∂J(3)∂h(3)∂y^(3)(∏j=k+13∂h(j−1)∂h(j))∂U∂h(k)