0:00
/
0:00
Transcript

Nested learning and the illusion of depth

Recent theoretical work argues that much of what is attributed to depth in modern neural networks can be explained by nested optimization dynamics and challenging assumptions

Thank you for tuning in to week 218 of the Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration for the Lindahl Letter is, “Nested learning and the illusion of depth.”

Just for fun with this nested learning paper we are evaluating today, I downloaded the 52 page PDF and uploaded it to my Google Drive to have Gemini create an audio overview of the paper. That is just a one button request these days. We have reached a point where we can easily listen to a paper recap with very little friction. It’s actually harder to get a complete reading of the PDF as an audio file. I had tried the Adobe Acrobat read aloud feature and I don’t really like the robotic output. Sometimes, I would rather listen to a paper than read it when I am trying to really think deeply about something. The 5 minutes of podcast audio Gemini spit out about the paper are embedded below. It’s interesting to say the least how quickly Gemini turned that paper into a short podcast. It’s entirely possible that my analysis might be less entertaining than the podcast Gemini created on the fly. You will be the judge of that one.

0:00
-4:46

This is a paper I actually printed out 2 pages per page using the double sided setting. That is how I used to read papers during graduate school. This paper had a few color elements that is something my graduate school papers never really had. They were all monochromatic. I had to put on my reading glasses and hold the paper a little closer than I used to with the 2 pages per page printing. I’ll have to remember to just print using single page spacing next time around. I really only print out papers I want to keep in my stack of stuff. This one certainly fits that criteria.

Trying to make content that is accessible is one of the reasons that I have been recording audio for the Lindahl Letter. Sometimes listening to something is a great unlock. Other times due to complexity and the diagrams included you just have to read academic papers. I try to bring things forward without complex charts in a highly consumable way. My take on research notes is that they need to be generally understandable and communicate a clear take on whatever topic is being covered. The content has to be condensed into something that can be considered in 5-10 minutes. To that end I’m going to do my best to bring this paper on nested learning to life today.

This paper matters, it really does, because the research presented undermines one of the core assumptions driving modern AI investment and the endless LLM building and training that has been occurring, namely that stacking more layers reliably produces qualitatively better intelligence [1]. The mantra to just keep scaling maybe will fade away. If many so-called deep models collapse into shallow equivalents during training, then reported gains attributed to architectural depth may instead be artifacts of data scale, regularization, or optimization heuristics rather than true representational progress.

This has direct implications for benchmarking, since comparisons that reward parameter count or depth risk overstating advances that do not translate into more robust reasoning or generalization. It also affects hardware and infrastructure strategy, because enormous resources are being allocated to support depth that may not deliver proportional returns. At a deeper level, the result forces a reconsideration of what meaningful learning progress actually looks like, shifting attention from surface complexity toward mechanisms that introduce genuinely new inductive structure and adaptive behavior.

Maybe the long term impact of this call out is likely to be gradual rather than abrupt, but it meaningfully shifts the intellectual ground beneath current AI narratives [1]. The paper in question provides a formal vocabulary for a concern many researchers have held intuitively, that architectural depth has become a proxy metric for progress rather than a principled design choice. Over time, this reframing may influence how serious research groups evaluate models, placing more weight on identifiably distinct learning mechanisms, training dynamics, and robustness properties instead of raw scale.

It is unlikely to immediately change the minds of investors or vendors whose incentives favor larger systems, but it can shape academic norms, reviewer expectations, and eventually benchmark construction. Historically, results like this matter most not because they halt a paradigm, but because they constrain it, narrowing the space of credible claims and forcing future advances to justify themselves on grounds other than appearance and size.

This argument intersects directly with my broader concerns about interpretability and generalization. I am still curious about creating a combiner model, but this might change the mechanics of how that might ultimately work. If performance gains arise primarily from optimization dynamics rather than architectural expressivity, then claims about learned representations should be treated with caution. Apparent abstraction may not correspond to stable semantic structure but to transient equilibria shaped by training order, learning rates, and implicit regularization. This aligns with growing skepticism about whether large models truly learn hierarchical concepts or merely approximate them through iterative adjustment [2].

The implications extend beyond theory. Nested learning reframes debates about model scaling, architectural novelty, and transfer learning. It suggests that progress may come less from ever deeper networks and more from better understanding and controlling learning dynamics. This has practical consequences for reproducibility, safety, and deployment, since nested optimization can introduce path dependence and sensitivity to training regimes that are difficult to observe or audit.

In the broader context of the AI marketplace, this work reinforces a recurring theme. Fluency and performance do not necessarily imply understanding. As with recent neuroscience critiques of language models, nested learning highlights how impressive outputs can emerge from mechanisms that lack stable, interpretable internal structure [3]. That gap matters when systems are deployed in high stakes environments where reliability, robustness, and reasoning are essential.

We will see how this plays out in 2026 and what new research will ultimately shift the landscape.

Footnotes:

[1] Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. “Nested learning: The illusion of deep learning architectures.” Advances in Neural Information Processing Systems 39 (2025). https://abehrouz.github.io/files/NL.pdf

[2] Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Dickstein, J. “On the expressive power of deep neural networks.” Proceedings of the 34th International Conference on Machine Learning (2017). https://arxiv.org/abs/1606.05336

[3] Riley, B. “Large language mistake: Cutting edge research shows language is not the same as intelligence.” The Verge (2025). https://www.theverge.com/ai-artificial-intelligence/827820/large-language-models-ai-intelligence-neuroscience-problems

What’s next for the Lindahl Letter? New editions arrive every Friday. If you are still listening at this point and enjoyed this content, then please take a moment and share it with a friend. If you are new to the Lindahl Letter, then please consider subscribing. Make sure to stay curious, stay informed, and enjoy the week ahead!

Discussion about this video

User's avatar

Ready for more?