Listen now | Recent theoretical work argues that much of what is attributed to depth in modern neural networks can be explained by nested optimization dynamics and challenging assumptions
This nested optimization angle is way more important than people realize. I've debugged training runs where we'd add layers expecting qualitative jumps in representaion learning, only to see marginal improvements that didn't justify the compute cost. The paper's point about path dependence in training dynmaics matches what I've seen when trying to reproduce SOTA results with slight config changes. If depth mostly buys us optimization stability rather than expressiveness, that changes ROI calculations for infrastructure spend pretty drastically.
This nested optimization angle is way more important than people realize. I've debugged training runs where we'd add layers expecting qualitative jumps in representaion learning, only to see marginal improvements that didn't justify the compute cost. The paper's point about path dependence in training dynmaics matches what I've seen when trying to reproduce SOTA results with slight config changes. If depth mostly buys us optimization stability rather than expressiveness, that changes ROI calculations for infrastructure spend pretty drastically.