ParlAI research shows that model parameters and computation can be decoupled in deep learning. Hash layers use simple hashing-based mixture-of-experts routing to increase model capacity without additional computation, while staircase attention stacks Transformers with shifted alignment to increase computation without adding parameters. The approaches show orthogonal improvements when combined.
Research
Which one is more important: more parameters or more computation? (2021)
ParlAI decouples model parameters from computation—hash-based MoE routing scales capacity without added compute, while staircase attention increases compute without new parameters, with orthogonal gains when combined.
Sunday, April 26, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
research