research
The Depth Delusion (architecture-conditioned scaling)
Evidence that width should grow faster than depth for transformers
TransformersScaling LawsDeep LearningNLP & LLMs
View project
Problem#
Scaling laws are often presented as if model architecture is interchangeable, but depth vs width choices can change outcomes even at the same parameter budget.
Approach#
This project analyzes a suite of transformer architectures and fits architecture-conditioned scaling laws that explicitly model depth/width dependence.
Key findings (high-level)#
- Optimal width grows substantially faster than optimal depth as compute increases.
- There is a “critical depth” regime where adding layers can increase loss despite adding parameters.
Links#
- Paper (arXiv):
https://arxiv.org/abs/2601.20994 - PDF:
https://arxiv.org/pdf/2601.20994.pdf
Reproducibility checklist (what I aim to provide)#
- architecture grid definition (depth/width)
- training recipe + compute accounting
- fitted scaling model + plotting scripts
- full tables for the evaluated architectures