Learn Before
Segment-Level Recurrence for Memory Models
To improve computational efficiency, recurrence can be applied to segment-level modeling rather than processing individual tokens. A simple approach is to divide the input sequence into segments and treat the key-value sequence, , as a single segment. Applying recurrent models to the memory update function, , results in memory models that operate directly on these larger chunks of the sequence.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Neural Network as a Memory Component
Segment-Level Recurrence for Memory Models
A memory-based attention mechanism updates its fixed-size memory state,
Mem, at each time stepiusing a general recurrent formula:Mem_new = f((k_i, v_i), Mem_old), where(k_i, v_i)is the current key-value pair andMem_oldis the memory state from the previous step. Which of the following update procedures does NOT conform to this recurrent structure?Calculating a Recurrent Memory State
Consider a memory update process defined by the recurrent function
Mem_new = f((k_i, v_i), Mem_old), where(k_i, v_i)is the input at the current step andMem_oldis the memory state from the previous step. To compute the memory state for step 100, this process requires direct access to the individual key-value pairs from all 99 preceding steps (i.e., from step 1 to 99).Formula for Memory as a Cumulative Average of Keys and Values
Learn After
FIFO Function as a Memory Update Example
Two-Segment Memory in Segment-Level Recurrence
Recurrent Memory Update using Segments
A language model is designed to process very long documents. Two memory update strategies are being considered. Strategy A updates the model's memory after processing each individual input unit. Strategy B updates the memory only after processing a block of 128 consecutive input units. What is the primary trade-off when choosing Strategy B over Strategy A?
A language model processes text by grouping it into non-overlapping blocks of 128 tokens. The model's memory is updated only after an entire block is processed. A developer observes that the model frequently fails to capture dependencies between the last word of one block and the first word of the very next block. What is the most direct cause of this specific issue?
Trade-offs in Memory Update Strategies
Optimizing a Language Model for Long Document Processing