1 Comment
User's avatar
Neural Foundry's avatar

This nested optimization angle is way more important than people realize. I've debugged training runs where we'd add layers expecting qualitative jumps in representaion learning, only to see marginal improvements that didn't justify the compute cost. The paper's point about path dependence in training dynmaics matches what I've seen when trying to reproduce SOTA results with slight config changes. If depth mostly buys us optimization stability rather than expressiveness, that changes ROI calculations for infrastructure spend pretty drastically.