The Depth Delusion (architecture-conditioned scaling)

Problem

Scaling laws are often presented as if model architecture is interchangeable, but depth vs width choices can change outcomes even at the same parameter budget.

Approach

This project analyzes a suite of transformer architectures and fits architecture-conditioned scaling laws that explicitly model depth/width dependence.

Key findings (high-level)

Optimal width grows substantially faster than optimal depth as compute increases.
There is a “critical depth” regime where adding layers can increase loss despite adding parameters.

Reproducibility checklist (what I aim to provide)

architecture grid definition (depth/width)
training recipe + compute accounting
fitted scaling model + plotting scripts
full tables for the evaluated architectures

Problem

Approach

Key findings (high-level)

Links

Reproducibility checklist (what I aim to provide)