When comparing the architectures of different Large Language Models (LLMs), the number of attention heads is often expressed in an $$a/b$$ format. In this specific notation, $$a$$ indicates the number of attention heads used for queries, while $$b$$ indicates the number of heads used for both keys and values.

Q/KV Heads Notation in LLM Architectures

The size of a Large Language Model (LLM) is determined by its parameters, which scale according to its depth (denoted as $$L$$), width (denoted as $$d$$), and the number of attention heads (for queries, keys, and values). For instance, early models like GPT-1 had 0.117 billion parameters with a depth of 12 and width of 768. In contrast, modern models have scaled massively: the LLaMA series scales up to 405 billion parameters, while DeepSeek-V3 reaches 671 billion parameters. Other prominent model families that demonstrate this variation in structural setup include Gemma2, Qwen2.5, Falcon, and Mistral.

Google

Despite varying implementation details, many Large Language Models (LLMs) share a common foundational Transformer architecture designed for language modeling. These models earn the designation 'large' because they feature significant scale in both their depth (the number of stacked layers or blocks) and their width (the dimensionality of their internal representations).

Learn Before

Related

Learn After