All projects
research

The Depth Delusion (architecture-conditioned scaling)

Evidence that width should grow faster than depth for transformers

TransformersScaling LawsDeep LearningNLP & LLMs
View project

Problem#

Scaling laws are often presented as if model architecture is interchangeable, but depth vs width choices can change outcomes even at the same parameter budget.

Approach#

This project analyzes a suite of transformer architectures and fits architecture-conditioned scaling laws that explicitly model depth/width dependence.

Key findings (high-level)#

  • Optimal width grows substantially faster than optimal depth as compute increases.
  • There is a “critical depth” regime where adding layers can increase loss despite adding parameters.
  • Paper (arXiv): https://arxiv.org/abs/2601.20994
  • PDF: https://arxiv.org/pdf/2601.20994.pdf

Reproducibility checklist (what I aim to provide)#

  • architecture grid definition (depth/width)
  • training recipe + compute accounting
  • fitted scaling model + plotting scripts
  • full tables for the evaluated architectures