Mixture-of-Experts
7 mentions across all digests
Mixture-of-Experts (MoE) is a neural network architecture that routes inputs to specialized subnetworks, used in models like Gemma 4 (26B-A4B MoE) and studied for how expert load balancing evolves through three distinct training phases.
Which one is more important: more parameters or more computation? (2021)
ParlAI decouples model parameters from computation—hash-based MoE routing scales capacity without added compute, while staircase attention increases compute without new parameters, with orthogonal gains when combined.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert Upcycling technique optimizes Mixture-of-Experts model efficiency, enabling cheaper inference through smarter expert reuse and routing.
Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
Research reveals how Mixture-of-Experts models optimize expert routing and load balancing across three predictable training phases, demystifying scaling dynamics in modern LLMs.
Gemma 4: Byte for byte, the most capable open models
Google DeepMind released Gemma 4, a family of four Apache 2.0-licensed multimodal models (up to 31B parameters) with optimized parameter efficiency through Per-Layer Embeddings, supporting images, video, and audio.
A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026
Raschka surveys 10 open-weight LLM architectures from Jan-Feb 2026 (Arcee, Moonshot, Qwen, Cohere) spanning 3B to 1T parameters, revealing divergent design choices in MoE configs and efficiency strategies.