Blogs / Mixture of Depths (MoD): Dynamic Compute Allocation in Transformer Models

Mixture of Depths (MoD): Dynamic Compute Allocation in Transformer Models

Mixture of Depths (MoD): تخصیص دینامیک محاسبات در مدل‌های ترنسفورمر

Introduction

Imagine reading a complex book. Do you spend equal time and effort on every word? Certainly not! Some sentences are simple and you can quickly move past them, while others are complex and require more focus. This is exactly the problem that transformer-based language models face.
In traditional transformer architectures, all tokens (text processing units) pass through all model layers uniformly and undergo the same computations. What does this mean? It means the model spends the same computational energy processing a simple word like "and" as it does processing a complex technical phrase. This approach is not only inefficient but also leads to enormous computational costs during model training and inference.
But what if we could teach language models to allocate their computational resources intelligently, like humans do? What if the model could decide for itself which tokens need deep processing and which can pass through layers with minimal computation? This is exactly what Mixture of Depths (MoD) aims to achieve.

What is Mixture of Depths? A Look at the Intelligent Architecture

Mixture of Depths, or MoD for short, is an innovative technique that allows transformer models to dynamically allocate computations to different tokens at different sequence positions. Simply put, MoD gives the model the power to decide for each token whether it should go through the full processing path (self-attention and MLP) or use a residual connection and skip computations.

Base Architecture and MoD Structure

In MoD, a fixed computational budget is enforced by limiting the number of tokens that can participate in a given layer's self-attention and MLP computations. This means:
  1. Routing Mechanism: In each layer, a router network assigns a numerical weight to each token
  2. Top-k Selection: Tokens with the highest weights (top-k) are selected for full processing
  3. Residual Path: Remaining tokens pass through the residual connection and remain unchanged
Since k is predefined, this method uses a static computational graph with determined tensor sizes, which enables efficient implementation.

Differences Between MoD and Mixture of Experts (MoE)

You may be familiar with the concept of Mixture of Experts (MoE). The MoD approach uses the routing logic of MoE transformers, but instead of having multiple experts, MoD employs a single expert that can be dynamically skipped.
While MoE focuses on expanding the "width" of the model (by adding multiple experts), the MoD routing mechanism can be thought of as a "depth sparsity" version of how MoE models scale.
Key differences:
  • MoE: Tokens are routed to different experts, all tokens are processed
  • MoD: Tokens are either processed or skipped, reducing computational cost
  • MoE: Applied only to MLP layers
  • MoD: Applied to both self-attention and MLP

Why MoD? Advantages and Practical Applications

Significant Reduction in Computational Cost

In experiments, MoD models were able to match the performance of standard transformers while using only 50% of computations per forward pass. This means:
  • 50% reduction in FLOPs: Half the computational operations for similar results
  • Higher speed: Faster processing during training and inference
  • Energy savings: Lower energy consumption for data centers

Improved Performance with Equal Budget

Additionally, when given the same training FLOPs budget, MoD models showed up to 1.5% better performance than standard transformers in final perplexity. This shows that MoD is not only more efficient but can also help the model learn better.

Intelligent Routing and Meaningful Learning

Learned routing is essential - models using random routing performed significantly worse. This shows the model truly learns which tokens are important and which can be skipped.
When examining the model's choices, one can find tokens that are processed by later block layers despite having passed through relatively few of the total blocks in the model's depth. This unique feature distinguishes MoD from traditional early-exit methods.

How MoD Works: From Theory to Practice

The Routing Mechanism in Detail

For each token, a "router" network generates a numerical weight. Then, the top-k tokens with the highest weights are selected for computation, while the rest are passed through a residual connection.
Exact steps:
  1. Input: Token sequence enters the MoD layer
  2. Weight Calculation: Router computes a score for each token
  3. Selection: k tokens with highest scores are selected
  4. Dual Processing:
    • Selected tokens: Pass through full self-attention and MLP
    • Other tokens: Pass directly through residual connection

Optimal Settings

Routing at every other layer with 12.5% capacity (processing only 12.5% of tokens) achieved the best results. This finding shows that:
  • Not all tokens need processing in all layers
  • Even processing a small percentage of tokens can be sufficient
  • Balance between efficiency and performance is critical

Combining MoD with MoE: Doubled Power

The MoD technique can be implemented alongside MoE (together forming MoDE models) in two simple ways: staged, which implements MoD machinery before MoE machinery, and integrated, which uses a single routing operation to route tokens to experts or no-op operations.

MoDE Models (Mixture-of-Depths-and-Experts)

Two approaches to combination:
  1. Staged MoDE: MoD first, then MoE
    • Tokens first decide whether to enter the block
    • Then routed to different experts
    • Allows skipping self-attention
  2. Integrated MoDE: Unified routing
    • Single router for both decisions
    • "no-op" experts alongside regular experts
    • Simpler structure
Implementing MoDE in the integrated manner significantly outperformed simply reducing expert capacity in regular MoE models, as tokens explicitly learn to choose the residual path.

Practical Applications in Multimodal Language Models (MLLMs)

p-MoD: Advanced Adaptation for Multimodal Models

The p-MoD model matches or even surpasses baseline models with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
These results demonstrate:
  • 44.4% reduction in inference cost
  • 46.2% reduction in memory storage
  • 22.3% reduction in training time

γ-MoD: Intelligent Adaptation

For example, with only a minor 1.5% performance drop, γ-MoD can reduce LLaVA-HR training and inference time by 31% and 53.2% respectively.
γ-MoD uses an innovative metric called ARank (Rank of Attention Maps) to detect which layers have redundancy and should be replaced with MoD layers.

Progressive Ratio Decay (PRD) Strategy

Visual tokens show more redundancy in deeper layers, and therefore a progressive ratio decay (PRD) strategy is designed that gradually reduces the token retention ratio layer by layer.
This means:
  • Early layers: Processing more tokens
  • Middle layers: Gradual token reduction
  • Deep layers: Focus on critical tokens

Performance Comparison: Experimental Results

Results on Language Models

Studies show that:
MoD models can match vanilla transformer models in training objective while using a fraction of FLOPs (up to 50%) per forward pass and are therefore faster.
Additionally, one can train a MoD transformer that improves up to 1.5% in final log probability objective for equivalent training FLOPs (isoFLOP).

Results on Multimodal Models

The p-MoD model shows comparable or even better performance than baseline models across 14 benchmarks in different domains, with a 46.2% reduction in KV cache storage and 44.4% reduction in TFLOPs during inference.

Challenges and Limitations

Implementation Complexity

Integrating MoD into MLLMs is challenging. To address these issues, innovative designs such as:
  • TanhNorm: Weight normalization with tanh gate
  • STRing: Symmetric token reweighting
These designs are essential for improving training and inference stability.

Need for Precise Training

Learned routing is essential - models using random routing performed significantly worse. This means the need for:
  • Precise training of routing mechanism
  • Hyperparameter tuning
  • More initial training time

The Future of MoD and Potential Possibilities

Path Toward AGI

MoD and its significant achievements mean we can have large models with parameters exceeding 3T that can retain more knowledge from training data and be more effective at problem-solving.
Future potentials:
  • Larger models with better efficiency
  • Mobile device deployment: In a way, this is a path to having a powerful language model on your smartphone or computer given that computational requirements are much lower compared to existing transformer models
  • Reduced carbon footprint: Less energy, less pollution

Future Research

MoD raises interesting questions for future research: Analyzing how the model learns to prioritize tokens for processing can provide insights into the model's inner workings and its understanding of language.
Research areas:
  • Long-term memory management
  • Different types of computations
  • Combination with other efficiency techniques
  • Better interpretability

Connection to Related Technologies

MoD exists within a broader ecosystem of AI optimization techniques:

Connection to Modern Architectures

Real-World and Industrial Applications

Advanced Natural Language Processing

MoD can be used in various Natural Language Processing applications:

Multimodal Models

In the field of Multimodal AI Models:
  • Simultaneous image and text processing
  • Multimodal content generation
  • Deep semantic understanding

Efficient Machine Learning

Connection to Machine Learning and optimization techniques:

Conclusion: A More Efficient Future with MoD

MoD represents a significant advance in efficient language modeling and offers a compelling alternative to traditional transformer architectures. Its ability to dynamically allocate computational resources leads to improved performance, faster inference, and better resource efficiency.
Key Points:
  1. 50% reduction in computational cost while maintaining or improving performance
  2. Intelligent routing where the model learns which tokens are important
  3. Composability with other techniques like MoE
  4. Wide applications from language models to multimodal
As MoD continues to develop and gain adoption, we can expect more efficient, faster, and more accessible artificial intelligence models. This technology is not only critical for advancing artificial intelligence but also essential for building a more sustainable future with lower energy consumption.
The path toward AGI and autonomous artificial intelligence requires such innovations that can combine power with efficiency. MoD demonstrates that we can build models that are not only smarter but also wiser in their use of resources - exactly as the human brain operates with its unparalleled efficiency.