Read the following scenario and answer the questions that follow.

Google

In the study of Large Language Models (LLMs), compositional reasoning tasks, such as the SCAN benchmark, are specifically designed to evaluate a model's capacity for compositional generalization. These tasks are crucial for testing the depth of an LLM's language understanding and reasoning, and they also help drive the development of more advanced problem decomposition methods.

Compositional Reasoning Tasks for LLMs

The SCAN (Simplified versions of the CommAI Navigation tasks) benchmark is a set of tasks created to test a Large Language Model's capacity for compositional generalization. These tasks require the model to translate natural language instructions into a corresponding sequence of actions.

SCAN Benchmark

A research team is designing a task to evaluate a language model's ability to understand and execute novel combinations of familiar instructions. The model will be trained on a set of commands and their corresponding action sequences. Which of the following training and testing splits would provide the most rigorous and direct assessment of the model's compositional reasoning capabilities?

Diagnosing LLM Generalization Failure

A new language model achieves 99% accuracy on a test set where the examples are structured very similarly to its training data. However, it performs poorly on a different set of tasks that require it to execute novel combinations of familiar commands (e.g., it was trained on 'walk twice' and 'jump', but tested on 'walk after jumping'). Critique the claim that the 99% accuracy score represents a comprehensive understanding of the language. What does the model's poor performance on the second task set reveal about the nature of its learning and the limitations of the initial evaluation method?

Learn Before

Related