Blogs / Mixture of Depths (MoD): Dynamic Compute Allocation in Transformer Models

Mixture of Depths (MoD): Dynamic Compute Allocation in Transformer Models

October 22, 2025

Mixture of Depths (MoD): تخصیص دینامیک محاسبات در مدل‌های ترنسفورمر

Introduction

Imagine reading a complex book. Do you spend equal time and effort on every word? Certainly not! Some sentences are simple and you can quickly move past them, while others are complex and require more focus. This is exactly the problem that transformer-based language models face.

In traditional transformer architectures, all tokens (text processing units) pass through all model layers uniformly and undergo the same computations. What does this mean? It means the model spends the same computational energy processing a simple word like "and" as it does processing a complex technical phrase. This approach is not only inefficient but also leads to enormous computational costs during model training and inference.

But what if we could teach language models to allocate their computational resources intelligently, like humans do? What if the model could decide for itself which tokens need deep processing and which can pass through layers with minimal computation? This is exactly what Mixture of Depths (MoD) aims to achieve.

What is Mixture of Depths? A Look at the Intelligent Architecture

Mixture of Depths, or MoD for short, is an innovative technique that allows transformer models to dynamically allocate computations to different tokens at different sequence positions. Simply put, MoD gives the model the power to decide for each token whether it should go through the full processing path (self-attention and MLP) or use a residual connection and skip computations.

Base Architecture and MoD Structure

In MoD, a fixed computational budget is enforced by limiting the number of tokens that can participate in a given layer's self-attention and MLP computations. This means:

Routing Mechanism: In each layer, a router network assigns a numerical weight to each token
Top-k Selection: Tokens with the highest weights (top-k) are selected for full processing
Residual Path: Remaining tokens pass through the residual connection and remain unchanged

Since k is predefined, this method uses a static computational graph with determined tensor sizes, which enables efficient implementation.

Differences Between MoD and Mixture of Experts (MoE)

You may be familiar with the concept of Mixture of Experts (MoE). The MoD approach uses the routing logic of MoE transformers, but instead of having multiple experts, MoD employs a single expert that can be dynamically skipped.

While MoE focuses on expanding the "width" of the model (by adding multiple experts), the MoD routing mechanism can be thought of as a "depth sparsity" version of how MoE models scale.

Key differences:

MoE: Tokens are routed to different experts, all tokens are processed
MoD: Tokens are either processed or skipped, reducing computational cost
MoE: Applied only to MLP layers
MoD: Applied to both self-attention and MLP

Why MoD? Advantages and Practical Applications

Significant Reduction in Computational Cost

In experiments, MoD models were able to match the performance of standard transformers while using only 50% of computations per forward pass. This means:

50% reduction in FLOPs: Half the computational operations for similar results
Higher speed: Faster processing during training and inference
Energy savings: Lower energy consumption for data centers

Improved Performance with Equal Budget

Additionally, when given the same training FLOPs budget, MoD models showed up to 1.5% better performance than standard transformers in final perplexity. This shows that MoD is not only more efficient but can also help the model learn better.

Intelligent Routing and Meaningful Learning

Learned routing is essential - models using random routing performed significantly worse. This shows the model truly learns which tokens are important and which can be skipped.

When examining the model's choices, one can find tokens that are processed by later block layers despite having passed through relatively few of the total blocks in the model's depth. This unique feature distinguishes MoD from traditional early-exit methods.

How MoD Works: From Theory to Practice

The Routing Mechanism in Detail

For each token, a "router" network generates a numerical weight. Then, the top-k tokens with the highest weights are selected for computation, while the rest are passed through a residual connection.

Exact steps:

Input: Token sequence enters the MoD layer
Weight Calculation: Router computes a score for each token
Selection: k tokens with highest scores are selected
Dual Processing:
- Selected tokens: Pass through full self-attention and MLP
- Other tokens: Pass directly through residual connection

Optimal Settings

Routing at every other layer with 12.5% capacity (processing only 12.5% of tokens) achieved the best results. This finding shows that:

Not all tokens need processing in all layers
Even processing a small percentage of tokens can be sufficient
Balance between efficiency and performance is critical

Combining MoD with MoE: Doubled Power

The MoD technique can be implemented alongside MoE (together forming MoDE models) in two simple ways: staged, which implements MoD machinery before MoE machinery, and integrated, which uses a single routing operation to route tokens to experts or no-op operations.

MoDE Models (Mixture-of-Depths-and-Experts)

Two approaches to combination:

Staged MoDE: MoD first, then MoE
- Tokens first decide whether to enter the block
- Then routed to different experts
- Allows skipping self-attention
Integrated MoDE: Unified routing
- Single router for both decisions
- "no-op" experts alongside regular experts
- Simpler structure

Implementing MoDE in the integrated manner significantly outperformed simply reducing expert capacity in regular MoE models, as tokens explicitly learn to choose the residual path.

Practical Applications in Multimodal Language Models (MLLMs)

p-MoD: Advanced Adaptation for Multimodal Models

The p-MoD model matches or even surpasses baseline models with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

These results demonstrate:

44.4% reduction in inference cost
46.2% reduction in memory storage
22.3% reduction in training time

γ-MoD: Intelligent Adaptation

For example, with only a minor 1.5% performance drop, γ-MoD can reduce LLaVA-HR training and inference time by 31% and 53.2% respectively.

γ-MoD uses an innovative metric called ARank (Rank of Attention Maps) to detect which layers have redundancy and should be replaced with MoD layers.

Progressive Ratio Decay (PRD) Strategy

Visual tokens show more redundancy in deeper layers, and therefore a progressive ratio decay (PRD) strategy is designed that gradually reduces the token retention ratio layer by layer.

This means:

Early layers: Processing more tokens
Middle layers: Gradual token reduction
Deep layers: Focus on critical tokens

Performance Comparison: Experimental Results

Results on Language Models

Studies show that:

MoD models can match vanilla transformer models in training objective while using a fraction of FLOPs (up to 50%) per forward pass and are therefore faster.

Additionally, one can train a MoD transformer that improves up to 1.5% in final log probability objective for equivalent training FLOPs (isoFLOP).

Results on Multimodal Models

The p-MoD model shows comparable or even better performance than baseline models across 14 benchmarks in different domains, with a 46.2% reduction in KV cache storage and 44.4% reduction in TFLOPs during inference.

Challenges and Limitations

Implementation Complexity

Integrating MoD into MLLMs is challenging. To address these issues, innovative designs such as:

TanhNorm: Weight normalization with tanh gate
STRing: Symmetric token reweighting

These designs are essential for improving training and inference stability.

Need for Precise Training

Learned routing is essential - models using random routing performed significantly worse. This means the need for:

Precise training of routing mechanism
Hyperparameter tuning
More initial training time

The Future of MoD and Potential Possibilities

Path Toward AGI

MoD and its significant achievements mean we can have large models with parameters exceeding 3T that can retain more knowledge from training data and be more effective at problem-solving.

Future potentials:

Larger models with better efficiency
Mobile device deployment: In a way, this is a path to having a powerful language model on your smartphone or computer given that computational requirements are much lower compared to existing transformer models
Reduced carbon footprint: Less energy, less pollution

Future Research

MoD raises interesting questions for future research: Analyzing how the model learns to prioritize tokens for processing can provide insights into the model's inner workings and its understanding of language.

Research areas:

Long-term memory management
Different types of computations
Combination with other efficiency techniques
Better interpretability

Connection to Related Technologies

MoD exists within a broader ecosystem of AI optimization techniques:

Transformer Models: The base architecture MoD works on
Attention Mechanism: The computational core MoD optimizes
LSTM and GRU: Older architectures with different approaches
Deep Learning: The broader field
TensorFlow and PyTorch: Implementation frameworks

Connection to Modern Architectures

Mamba Architecture: Alternative approach to efficiency
RWKV Architecture: RNN-Transformer hybrid
Kolmogorov-Arnold Networks (KAN): Another innovative architecture
Neuromorphic Computing: Brain-inspired efficiency

Real-World and Industrial Applications

Advanced Natural Language Processing

MoD can be used in various Natural Language Processing applications:

ChatGPT and similar models
Claude AI Assistant
Gemini Model

Multimodal Models

In the field of Multimodal AI Models:

Simultaneous image and text processing
Multimodal content generation
Deep semantic understanding

Efficient Machine Learning

Connection to Machine Learning and optimization techniques:

LoRA (Low-Rank Adaptation): Efficient fine-tuning
QLoRA: Quantization with LoRA
Flash Attention: Attention optimization

Conclusion: A More Efficient Future with MoD

MoD represents a significant advance in efficient language modeling and offers a compelling alternative to traditional transformer architectures. Its ability to dynamically allocate computational resources leads to improved performance, faster inference, and better resource efficiency.

Key Points:

50% reduction in computational cost while maintaining or improving performance
Intelligent routing where the model learns which tokens are important
Composability with other techniques like MoE
Wide applications from language models to multimodal

As MoD continues to develop and gain adoption, we can expect more efficient, faster, and more accessible artificial intelligence models. This technology is not only critical for advancing artificial intelligence but also essential for building a more sustainable future with lower energy consumption.

The path toward AGI and autonomous artificial intelligence requires such innovations that can combine power with efficiency. MoD demonstrates that we can build models that are not only smarter but also wiser in their use of resources - exactly as the human brain operates with its unparalleled efficiency.

✨

With DeepFa, AI is in your hands!!

🚀

Welcome to DeepFa, where innovation and AI come together to transform the world of creativity and productivity!

🔥 Advanced language models: Leverage powerful models like Dalle, Stable Diffusion, Gemini 2.5 Pro, Claude 4.5, GPT-5, and more to create incredible content that captivates everyone.
🔥 Text-to-speech and vice versa: With our advanced technologies, easily convert your texts to speech or generate accurate and professional texts from speech.
🔥 Content creation and editing: Use our tools to create stunning texts, images, and videos, and craft content that stays memorable.
🔥 Data analysis and enterprise solutions: With our API platform, easily analyze complex data and implement key optimizations for your business.

✨ Enter a new world of possibilities with DeepFa! To explore our advanced services and tools, visit our website and take a step forward:

Explore Our Services

DeepFa is with you to unleash your creativity to the fullest and elevate productivity to a new level using advanced AI tools. Now is the time to build the future together!