Thank you for tuning in to week 206 of the Lindahl Letter publication. A new edition arrives every Friday. This week the topic under consideration is “The Great Tokenapocalypse.”
As large language models reach deeper into consumer devices, the cost of running them becomes the real bottleneck. So many tokens get burned with no ROI or use case for the company burning them; it's really out of control. Almost as out of control as the sunk cost of data centers that will probably be regretted at some point in the next 5 years. It’s sort of the unspoken reality of an arms race where building data centers that just depreciate and spending compute resources without any plan for recovering the cost is happening. This week explores how token economics is silently shaping the deployment strategies of Google and Apple.
You may have noticed something strange about the rollout of generative AI: despite Google’s global reach and technical infrastructure, Gemini is not yet present on every device. It isn’t quietly running in the background on your Nest Hub, it doesn’t summarize content on your Pixel Watch, and it hasn’t taken over the always-on interactions that dominate the smart home experience. On paper, Gemini could power all of this: but in practice, it doesn’t. The reasons are not technical, but economic.
It’s the tokens. Each time a large language model like Gemini processes a prompt or generates a response, it consumes tokens which are effectively a unit of computation that translate directly into cost. This cost is not abstract. It is real-time, metered, and at scale becomes wildly continuous with enough uses. When you ask Gemini to summarize an email or rewrite a paragraph, you’re triggering a live cloud inference cycle that draws directly on Google’s TPU infrastructure. At a small scale, these requests are manageable. But when deployed across millions of devices, in billions of micro-interactions, the financial and infrastructure burden becomes extreme. What looks like product restraint is actually cost containment. Google is avoiding what could become a tokenapocalypse which would be a runaway escalation of inference demand that outpaces both compute supply and operating budget.
Gemini was designed for centralized, high-performance environments. It was not optimized for low-power edge devices or offline operation. Its rollout has been concentrated in strategic, high-leverage use cases: Workspace productivity, Pixel exclusives, and experimental features inside Search Labs. These are high-value zones where the cost per token can be justified. Gemini has not been deployed ambiently in the wild on smart speakers, in Android Auto, or on lightweight wearables mostly because those endpoints offer little to no margin against token cost. The model cannot run constantly without triggering exponential cloud expenditure. Until inference becomes drastically cheaper or edge-native Gemini variants emerge, Google is likely to continue rationing its deployment to protect against economic overextension.
Apple, by contrast, has chosen an entirely different path forward. They elected a path that avoids the token problem from the outset. Its 2024 rollout of “Apple Intelligence” emphasized a local-first architecture built around on-device models. Instead of sending every prompt to the cloud, Apple routes the vast majority of inference through its A-series and M-series silicon. This strategy means that users can rewrite notes, summarize messages, or interact with Siri entirely offline, with zero token cost to Apple. When tasks exceed the capability of local models, they are sent to Apple’s “Private Cloud Compute” system, but this fallback is used selectively, with strict privacy and latency guarantees.
Apple’s approach isn’t just a branding play. It reflects a fundamental architectural decision to avoid the economics of inference altogether. Apple doesn’t operate a hyperscale public cloud business, so it has no incentive to absorb or monetize cloud-based generative AI usage. Its profits come from hardware margins and platform services. This gives Apple the freedom to constrain usage, limit interaction complexity, and push AI to the edge. A strategy they can get away with, ultimately without incurring the compounding costs that Google faces. It’s a token-avoidant strategy, and it may prove to be the more sustainable one.
Where Google builds outward from a full-stack cloud foundation, Apple builds inward from a controlled edge. Google’s strategy scales across models and modalities, but each expansion amplifies cost. Apple’s strategy constrains functionality but keeps economics stable. Both are reacting to the same underlying pressure: token costs are rising faster than monetization models can support. The more embedded the model becomes, the more tokens flow. A stark reality comes into existence where it becomes more urgent to rethink deployment patterns. This isn’t just a question of technical feasibility. It’s a matter of financial survivability.
The race to deploy generative AI at scale is quickly becoming a race to control token exposure. Inference cost and not model quality may be the key determinant of which platforms can sustainably integrate AI across the stack. If cloud economics don’t shift, and if token optimization doesn’t advance, then ambient LLMs may remain a luxury reserved for premium endpoints and enterprise tasks. The real future of ubiquitous AI may depend less on how powerful models become, and more on how efficiently they run in the wild.
Things to consider:
Google’s restraint in deploying Gemini across its device ecosystem likely reflects real-time token cost constraints rather than technical limits.
Every cloud-based Gemini interaction consumes metered compute, making global deployment economically unstable without stronger monetization.
Apple avoids these problems by designing for on-device inference and constraining AI functionality to remain token-light.
Token economics are now shaping the strategic posture of every major platform, defining where and how AI appears in consumer workflows.
Sustained deployment of generative models may depend less on breakthrough architecture and more on advances in inference efficiency and local compute.
As the tokenapocalypse looms, we’ll be watching how companies respond. That response could be through model compression, edge acceleration, hybrid routing, and new monetization strategies. In the coming weeks, we’ll explore how these constraints are shaping research priorities, ecosystem fragmentation, and what it means to run AI sustainably across global networks. If you see an AI endpoint that should exist but doesn’t, it may be because someone, somewhere, did the token math.
What’s next for the Lindahl Letter? New editions arrive every Friday. If you are still listening at this point and enjoyed this content, then please take a moment and share it with a friend. If you are new to the Lindahl Letter, then please consider subscribing. Make sure to stay curious, stay informed, and enjoy the week ahead!