research
The Depth Delusion (architecture-conditioned scaling)
Evidence that width should grow faster than depth for transformers
TransformersScaling LawsDeep LearningNLP & LLMs
View project
Problem
Scaling laws are often presented as if model architecture is interchangeable, but depth vs width choices can change outcomes even at the same parameter budget.
Approach
This project analyzes a suite of transformer architectures and fits architecture-conditioned scaling laws that explicitly model depth/width dependence.
Key findings (high-level)
- Optimal width grows substantially faster than optimal depth as compute increases.
- There is a “critical depth” regime where adding layers can increase loss despite adding parameters.
Links
- Paper (arXiv):
https://arxiv.org/abs/2601.20994 - PDF:
https://arxiv.org/pdf/2601.20994.pdf
Reproducibility checklist (what I aim to provide)
- architecture grid definition (depth/width)
- training recipe + compute accounting
- fitted scaling model + plotting scripts
- full tables for the evaluated architectures