Models

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

Ensemble distillation lets 1.8B parameter models match traditional scaling performance on 10x less data—challenging the assumption that data volume and model size scale linearly.

Thursday, March 19, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline

Researchers achieved 10x data efficiency with NanoGPT using ensemble models, chain distillation, and aggressive regularization. An ensemble of 1.8B parameter models trained on 100M tokens matches performance normally requiring 1B tokens, challenging conventional scaling laws.

Read original at Hacker News