Based on the provided scenario, identify the most likely architectural reason for the model's sharp performance decline on longer documents and explain your reasoning.

Google

A major disadvantage of allocating a unique learnable value to every possible sequence offset is that the model becomes rigidly tied to the distances it observed during training. If the architecture processes sequences where the offset $$i - j$$ is greater than the maximum distance encountered in the training phase, it lacks the appropriate learned variables for those extended distances, preventing effective generalization.

Generalization Limit of Offset-Specific Biases

A language model is trained exclusively on text sequences with a maximum length of 1024 tokens. Its design includes a component where a unique, learnable numerical bias is assigned to every possible relative distance between token pairs (e.g., a specific bias for a distance of 1, another for a distance of 2, up to the maximum possible distance in the training data). What is the most likely outcome when this model is later tasked with processing a document of 1500 tokens?

A language model's architecture incorporates a mechanism where a unique, learnable value is assigned to every possible relative distance between any two tokens. For example, a specific value is learned for a distance of 5, another for a distance of 6, and so on, up to the maximum distance encountered in the training data. Evaluate the suitability of this design for a model intended to process documents of highly variable and potentially unbounded lengths. Justify your reasoning.

Learn Before

Related