1Cademy - A team is developing a language model for processing lengthy legal documents. They use a dual-memory architecture: a local memory that stores the most recent 1024 tokens and a compressive memory that stores a summarized representation of older text. To allow a query (representing a new token) to access information from both recent and long-term history, how should the attention mechanism be structured?

Learn Before

Attention Formula in Compressive Transformer

Multiple Choice

A team is developing a language model for processing lengthy legal documents. They use a dual-memory architecture: a 'local memory' that stores the most recent 1024 tokens and a 'compressive memory' that stores a summarized representation of older text. To allow a query (representing a new token) to access information from both recent and long-term history, how should the attention mechanism be structured?

Updated 2025-10-03

Contributors are:

Who are from:

Learn Before

Related