Signs of the triple-exponential takeoff: Deepseek's Christmas shocker on hardware utilization
This December was expected to be a hotly-contested month in the LLM space; Google released its new Gemini 2.0 model in tandem Project Mariner, a web-enabled agent powered by Gemini. OpenAI also released an incredible new set of results related to its o3 class of reasoning-focused models, showing an enormous jump in successful completion rate on the Abstraction and Reasoning Corpus (ARC) benchmark.
These are both tremendous achievements, showcasing the ability of the industry behemoths to continue the relentless advance towards higher intelligence and greater utility. However, none of these advances were especially unexpected. The biggest surprise from the last few weeks came from DeepSeek, a Chinese firm which dropped an absolute stunner of a technical report that's caught nearly everyone by surprise.
DeepSeek claimed to have reproduced top-tier performance for its DeepSeek v3 model (DSv3) and bested Meta's Llama 3.1 405b model by training with roughly 10x fewer resources consumed paying careful attention to the timing, precision, and ordering of the major calculations used in the pretraining process. There were major architectural differences too (Multihead latent attention, wide MoE) but I won't go into too much detail on this. I also won't attempt to give an in-depth description of their training approach as there are more qualified commenters who can dissect the specifics of how they used the export-crippled H800 chip so successfully.
All I will say is that if the work is indeed as claimed, it is a tour-de-force of meticulous analysis and coordination that flies in the face of US-based firms' approaches to leverage bigger and bigger clusters to produce similar results.
I will also note the opinion of some commentators who observe that DeepSeek AI has access to much larger clusters via their backer, High-Flyer, and therefore they may have only been able to reach their final results with substantially more resources than mentioned in their technical report.
Nonetheless, my position on this work is that it:
- Might be the first glimpse of a major research effort that has really invested in LLM-based tooling for engineering the AI training pipeline
- Hints at how much juice can be squeezed out of existing hardware with enough engineering focus and talent
Are we seeing LLM-assisted takeoff for frontier model development?
The evidence for LLM-assisted development is weaker and the argument proceeds thus - LLMs are clearly highly capable of understanding fine technical details as evidenced by their progress on coding and computation-related benchmarks. Deep skillsets for CUDA and other hardware-aware programming areas are relatively rare, though technical documentation and examples abound in the public internet (for NVIDIA hardware, at least).
The advances shown by the DeepSeek team look like what I would expect an LLM-enhanced development program to look like - lots of smaller optimizations that likely required substantial amounts of tedious experimentation and validation to investigate. The appetite to attack these problems is probably mostly cultural and the DeepSeek's CEO himself says that the company has an unusually young team and unrestricted access to big clusters regardless of hierarchy.
When layered on top of each other, 3-4 performance-doubling improvements produce the 11x increase in performance required to close the gap with western labs.
How much juice is left to squeeze?
The term 'double exponential' has come into vogue with the common usage referring to the improvement of both hardware and software for AI on independent axes, with compounding returns delivered from each. My central claim is that there is a surprising amount of headroom left in the efficient utilization of hardware by software.
This gap might be sufficiently large that even if innovation on new hardware and major architectures (i.e. Transformers vs SSMs) were to stall out, just improving on their utilization alone might cross the gap into the performance range required for economical, general-purpose machine intelligence.
Implications for NVIDIA
As an NVIDIA employee, I do feel obligated to disclose my bias on this. I've seen a rather large number of comments from laypeople regarding how this is clearly bad news for Team Green. Their cursory reading of the situation is that since training on NVIDIA hardware can be done much cheaper than before, then the demand from major AI labs and hyperscalers will evaporate at some point. I have a few thoughts on this:
- Much cheaper NVIDIA-centric training may lessen the demand to a certain extent, but it deepens the moat considerably. It's clear from the DSv3 tech report that much of their success was due to having a deep understanding of how to wrangle technology like InfiniBand and NVLink to avoid bottlenecks. They also called out Hopper-specific functionality as being vital for continued progress.
- As with electricity, lower marginal costs generally don't reduce revenue - they expand the total addressable market. Cheaper training means more organizations can do it on their own.
- We are still waiting on a viable approach to general-purpose AI for robotics, i.e. a robot that can be instructed to peel a potato or dig a hole and do it with little oversight. My hunch is that the training costs for this type of model are going to be at least 10-100x higher because of the need for video data with high spatiotemporal resolution. Creating a training framework which is 10x more cost effective means that the general-purpose robotic AI might finally be within striking distance, with huge implications for certain NVIDIA products like the Jetson series of edge computing devices.