As the number of examples in a training dataset increases, the computational cost of performing a single iteration of standard gradient descent grows significantly. Due to this high cost, stochastic gradient descent is generally preferred for large datasets.

Preference for Stochastic Gradient Descent with Large Datasets

When using standard gradient descent, the computational cost for each parameter update iteration is $$\mathcal{O}(n)$$, where $$n$$ is the number of examples in the training dataset. Because the full gradient computation requires evaluating the gradient of the loss function for every example, the cost grows linearly with the dataset size $$n$$. Consequently, standard gradient descent becomes highly expensive per iteration when applied to very large training datasets.

Claude

In deep learning, the objective function $$f(\mathbf{x})$$ is typically formulated as the average of the individual loss functions $$f_i(\mathbf{x})$$ across the $$n$$ examples in the training dataset, where $$\mathbf{x}$$ is the parameter vector. This formulation is given by: $$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ Consequently, the full gradient of the objective function at $$\mathbf{x}$$ is the average of the gradients for each example: $$
abla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n 
abla f_i(\mathbf{x}).$$

Learn Before

Related

Learn After