Learn Before
SGD Optimizer From-Scratch Implementation
A minimal from-scratch implementation of the stochastic gradient descent optimizer defines a function sgd(params, states, hyperparams) that accepts three arguments: a list of model parameters, optimizer states (unused for vanilla SGD), and a dictionary of hyperparameters. For each parameter tensor, the function subtracts the product of the learning rate and the parameter's gradient using an in-place operation, then zeroes the gradient. In PyTorch:
python def sgd(params, states, hyperparams): for p in params: p.data.sub_(hyperparams['lr'] * p.grad) p.grad.data.zero_()
This function signature—taking params, states, and hyperparams—is deliberately general so that more advanced optimizers introduced later (e.g., momentum, Adam) can share the same calling convention by making use of the states argument for maintaining auxiliary variables.
0
1
Tags
D2L
Dive into Deep Learning @ D2L