Learn Before
Comparing Parameter Sharing Strategies
Consider two different approaches for building a deep, multi-layered neural network.
Approach A: The network is constructed by stacking the exact same computational block (with a single, shared set of weights) multiple times.
Approach B: Each layer in the network has its own unique weights for most of its operations, but for one specific, computationally expensive part of the block (e.g., the Key and Value projection matrices in an attention mechanism), it reuses the outputs generated by the preceding layer.
Analyze these two approaches. Compare and contrast their likely effects on the final model's total parameter count, its ability to learn distinct features at different depths, and its memory usage during operation.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Cross-Layer Parameter Sharing in BERT
Cross-layer Multi-head Attention
A team of engineers is designing a deep neural network for a resource-constrained environment, such as a mobile device. To reduce the model's size, they implement a design where the same computational block, with its entire set of weights, is reused at every layer of the network. What is the most significant trade-off the engineers must consider with this approach?
Analyzing a Novel Transformer Architecture
Comparing Parameter Sharing Strategies