In contrast to exhaustive search, the computational cost of greedy search in sequence-to-sequence generation is linearly proportional to the sequence length, given by $$\mathcal{O}(|\mathcal{Y}|T')$$, where $$|\mathcal{Y}|$$ is the vocabulary size and $$T'$$ is the sequence length. This makes it miraculously cheap, although far from optimal. For example, with a vocabulary of $$|\mathcal{Y}|=10000$$ and a sequence length of $$T'=10$$, greedy search only requires evaluating $$10000 	imes 10 = 10^5$$ sequences.

Claude

In sequence-to-sequence models, the greedy search strategy is a straightforward decoding method where, at any time step $$t'$$, the model selects the single token from the vocabulary $$\mathcal{Y}$$ that has the highest conditional probability. This is mathematically expressed as: $$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$$ where $$\mathbf{c}$$ is the context vector representing the source input. The generation of the output sequence concludes once the model outputs the end-of-sequence token ("<eos>") or reaches a predefined maximum length $$T'$$.

Greedy Search Strategy in Sequence-to-Sequence Models

Dive into Deep Learning

Computational Cost of Greedy Search in Sequence-to-Sequence Models

Greedy search, an algorithm that sequentially selects the single most probable token at each step, operates identically to a special case of beam search configured with a beam size of $$k=1$$.

Greedy Search as a Special Case of Beam Search

Beam search is a sequence decoding strategy that strikes a compromise between the computational efficiency of greedy search and the optimality of exhaustive search. Instead of greedily picking the single most likely token or exploring all possible paths, beam search evaluates and retains a predetermined number of the most promising candidate sequences at each step of the generation process.

Beam Search Strategy in Sequence-to-Sequence Models

Consider a sequence generation model with an output dictionary consisting of the tokens "A", "B", "C", and "<eos>". A greedy search strategy selects the token with the highest conditional probability at each time step. Suppose that at step 1, the token with the highest probability is "A" (probability 0.5). At step 2, conditioned on generating "A", the most probable token is "B" (probability 0.4). At step 3, conditioned on "A" and "B", the most probable token is "C" (probability 0.4). Finally, at step 4, the model selects "<eos>" (probability 0.6). The greedy search algorithm therefore predicts the sequence "A", "B", "C", and "<eos>". The conditional probability of this entire output sequence is the product of the individual probabilities: $$ 0.5 \times 0.4 \times 0.4 \times 0.6 = 0.048 $$.

Learn Before

Related