Researchers achieved 10x data efficiency with NanoGPT using ensemble models, chain distillation, and aggressive regularization. An ensemble of 1.8B parameter models trained on 100M tokens matches performance normally requiring 1B tokens, challenging conventional scaling laws.
Models
NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute
Ensemble distillation lets 1.8B parameter models match traditional scaling performance on 10x less data—challenging the assumption that data volume and model size scale linearly.
Thursday, March 19, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
models