DeepSeek: Pioneering the Global AI Revolution

Part 1

Feb 10, 2025

Introduction

The emergence of DeepSeek in 2024 has sent shockwaves through the global AI community, as this relatively young organization achieved results on par with the most advanced models at a fraction of the typical cost. DeepSeek—a Chinese AI startup founded in 2023 with under 200 employees and backed by the High-Flyer quantitative hedge fund—open-sourced its flagship model DeepSeek-R1 in early 2025, just one day before OpenAI announced its massive $500 billion “Stargate” AI project. This timing underscored DeepSeek’s ambitions to compete at the cutting edge of AI research. The attention is well warranted: DeepSeek-R1, a large language model (LLM) optimized for reasoning, matches the quality of OpenAI’s best models at roughly 10% of the training cost, and operates nearly twice as fast.

For example, DeepSeek reported spending only around $6 million to train its latest model on 2,000 Nvidia H800 GPUs, compared to an estimated $80–100 million for training OpenAI’s GPT-4 and the 16,000 top-tier H100 GPUs reportedly used for Meta’s upcoming Llama 3. Such dramatic improvements in efficiency and cost have instantly elevated DeepSeek from an obscure player to a centerpiece of global AI discussions.

This rise comes against the backdrop of an AI landscape that, until recently, was dominated by a few tech giants and well-funded labs. Companies like OpenAI, Google DeepMind, Meta AI, and Anthropic have led the race by leveraging enormous compute resources and vast datasets, often with proprietary models. In this context, DeepSeek’s success represents a new competitive paradigm. It demonstrates that breakthrough AI capabilities are not exclusive to trillion-dollar companies—a lean startup with novel ideas can leapfrog incumbents by innovating in model architectures and training methods rather than merely scaling up compute. The global AI community has taken note: when DeepSeek’s R1 model was released open-source, it quickly became the top free app on U.S. app stores and spawned over 700 open-source derivative projects within days, indicating worldwide adoption. Industry observers have compared the “DeepSeek shock” to a wake-up call; the feat has even been described as a “positive development” by U.S. leadership for proving that one can achieve solutions with far less expenditure. In short, DeepSeek has captured significant attention by fitting a unique niche in the AI competition: it offers cutting-edge performance through open innovation and efficiency, potentially reshaping how AI research and development will progress on the world stage.

Technical Innovations

DeepSeek’s rapid ascent is rooted in technical innovations that set its models apart from conventional large language models. In particular, the development of DeepSeek-V2 and DeepSeek-V3 introduced breakthroughs in model architecture and training efficiency that were later leveraged to create DeepSeek-R1’s advanced reasoning capabilities. Three cornerstone innovations are: Mixture-of-Experts (MoE) architecture, Multi-Head Latent Attention (MLA), and a reinforcement learning approach called Grouped Relative Policy Optimization (GRPO). We discuss each of these below, and then compare DeepSeek’s design with approaches used by OpenAI, Meta, and Anthropic.

Mixture-of-Experts (MoE) Architecture: DeepSeek employs a massive MoE design in its models, allowing it to scale up parameters while keeping inference computationally feasible. In an MoE model, the network is divided into many “expert” sub-models, and a gating mechanism activates only a subset of these experts for each input. DeepSeek-V3, for example, contains a total of 671 billion parameters distributed among experts, but only 37 billion parameters are activated per token during inference. This means the model can tap into a huge knowledge capacity when needed, but it doesn’t always pay the full computational cost of all parameters. By comparison, traditional dense models (like GPT-3.5 or earlier GPT-4 designs) activate every parameter for every token, which is inefficient if not all parts of the network are needed for a given query. Notably, it has been speculated that even OpenAI’s GPT-4 adopted a form of MoE internally – rumors suggest it used on the order of 16 experts of ~110B parameters each
– although details were never publicly confirmed.
DeepSeek’s MoE implementation introduced its own improvements over prior MoE systems: it differentiated between fine-grained specialized experts and shared general experts, and introduced new load-balancing and routing strategies to ensure no single expert is overused. Traditionally, MoE models suffered from training instabilities and high communication overhead (since experts must be coordinated), but DeepSeek’s algorithms mitigated these issues, making even the training phase efficient, not just inference. In fact, DeepSeek-V2 (236B total parameters with 21B per-token active) was shown to save ~42.5% of training cost and boost generation throughput 5.7× relative to a dense baseline, thanks to these optimizations.
The MoE approach gives DeepSeek models essentially a “modular” brain that can allocate capacity as needed, contributing to both their strong performance and economic training.
Multi-Head Latent Attention (MLA): Another major innovation is DeepSeek’s multi-head latent attention, a technique designed to overcome the memory and speed bottlenecks of very large context windows. Modern LLMs often have context lengths of tens of thousands of tokens (for instance, OpenAI’s GPT-4 offers a 32K token context, and Anthropic’s Claude 2 expanded to 100K tokens), but handling such long inputs normally requires enormous memory to store attention keys and values for each token. DeepSeek’s MLA addresses this by compressing the key-value cache into a smaller latent representation.
In essence, instead of storing every past token’s full key and value vectors for attention, the model projects them into a lower-dimensional latent space, drastically reducing memory usage with minimal loss of information. This allows DeepSeek models to support extremely long contexts (DeepSeek-V2 and V3 support up to 128K tokens) without the usual quadratic memory blowup. The impact on efficiency is dramatic: memory usage during inference can be as low as 5–13% of that required by prior methods thanks to MLA.
By cutting down memory and computation for long sequences, MLA lets DeepSeek models reason over long documents or dialogues more practically. This is a distinctive edge; for comparison, most of Meta’s LLaMA models and OpenAI’s standard GPT models rely on dense attention and require specialized techniques or hardware to manage long contexts, whereas DeepSeek’s approach bakes in a scalable solution via compression. The end result is that DeepSeek models can handle long-context tasks (such as reading lengthy reports or multi-step narratives) more efficiently, and indeed DeepSeek-R1 shows substantially improved performance on long-context understanding versus its predecessor.
Grouped Relative Policy Optimization (GRPO) for Reinforcement Learning: DeepSeek-R1’s exceptional reasoning ability was cultivated not through the usual supervised fine-tuning alone, but heavily via a reinforcement learning (RL) regimen. To make this feasible, DeepSeek employed an RL algorithm called Group Relative Policy Optimization (GRPO). GRPO is a variant of policy optimization that forgoes the need for a separate value function (critic) network, thereby simplifying and stabilizing the RL training process.
Instead of using a learned critic to estimate the reward of each output, GRPO samples a group of candidate outputs from the current policy for each prompt, obtains reward scores for all of them, and then updates the policy by comparing relative rewards within that group.
This way, the model is trained to preferentially shift probability mass toward responses that are better than its own alternatives, without requiring an absolute value baseline. The DeepSeek team adopted GRPO as a “reasoning-oriented” RL technique to efficiently teach the model complex problem-solving behaviors.
In practice, DeepSeek-R1’s training involved multiple stages: an initial “R1-Zero” model was trained from the base (pretrained) model purely with RL and no supervised data, which led to emergent reasoning strategies but also issues like poor language fluency.
They then introduced a cold-start phase with curated data and further RL fine-tuning to create the final DeepSeek-R1, balancing raw reasoning power with coherence.
GRPO was crucial in this process, as it allowed scaling RL to very large models cost-effectively. By optimizing on relative improvements, the model could effectively “self-play” different reasoning paths and learn from comparisons, significantly enhancing its chain-of-thought reasoning and self-checking abilities.

This contrasts with the more common RLHF (reinforcement learning from human feedback) used by OpenAI and Anthropic, where a human-trained reward model (or a set of predefined principles, as in Anthropic’s Constitutional AI) guides the learning. DeepSeek’s approach instead leveraged automated reward signals and massive exploration of reasoning chains, which is an intriguing alternative for developing advanced reasoning in LLMs.

Comparison with OpenAI, Meta, and Anthropic: DeepSeek’s architecture and training philosophy differ in notable ways from those of its major Western counterparts:

OpenAI: OpenAI’s flagship models (GPT-3.5, GPT-4, etc.) have historically been dense Transformer architectures, though as noted above, GPT-4 is believed to incorporate some MoE-like structure internally.
Regardless of the exact architecture, OpenAI’s approach has emphasized scale (training on extremely large clusters of the latest GPUs) and extensive supervised fine-tuning with human feedback for alignment. GPT-4’s training likely involved enormous compute on proprietary data, followed by RLHF where humans graded model outputs to train a reward model. By contrast, DeepSeek’s model openly publishes its weights and training techniques, and leans into algorithmic efficiency – using an MoE to cut inference cost and RL with GRPO to autonomously improve reasoning without requiring tens of thousands of human ratings. This means DeepSeek-V3/R1 achieved GPT-4-level reasoning quality with far fewer resources, e.g. ~2.8 million GPU hours (on H800s) versus an estimated 20–30 million GPU hours for GPT-4.

Another difference is context handling: OpenAI extended GPT-4’s context to 32K tokens by engineering optimizations (and possibly using multi-query attention), but DeepSeek’s MLA pushes context to 128K with a novel compression mechanism.
In benchmarks, DeepSeek models have begun to rival or even surpass OpenAI’s on certain tasks. For instance, DeepSeek-V3 achieved 88.5% accuracy on the English MMLU exam, slightly edging out OpenAI’s reported 87.2% for GPT-4 (often denoted “ChatGPT-4o”), and DeepSeek also led on coding benchmarks like HumanEval with a pass rate above 82%.
OpenAI still holds advantages in some areas (such as coding competitions – OpenAI’s unreleased “o3” model is noted to outperform DeepSeek-R1 in Codeforces challenges), but the gap is quickly closing. In summary, OpenAI’s strategy of brute-force scale and closed development contrasts with DeepSeek’s emphasis on efficient design and open reinforcement learning, yet both have arrived at comparable levels of capability in many reasoning benchmarks.
Meta (Llama series): Meta’s LLaMA family took a different route to “open” models by releasing highly performant LLMs (Llama 1 and 2) to researchers, albeit under a non-commercial license. LLaMA models are dense Transformer networks without MoE – for example, Llama 3.1 is rumored to have on the order of 405B dense parameters across its layers.
Meta focused on training on high-quality data and optimizing training techniques (like longer context up to 100K tokens in Llama 3, and improved multilingual ability) but did not implement sparse experts or latent attention compression as DeepSeek did. The result is that, compared to DeepSeek-V3, Meta’s models use more compute per token (activating all 405B parameters vs. 37B active in DeepSeek V3) to achieve similar performance. Indeed, DeepSeek-V3 and Llama 3 achieve roughly comparable scores on many tasks (Llama 3.1 is noted to closely compete in advanced language and math tasks despite fewer total parameters), which speaks to Meta’s data and training prowess. However, DeepSeek’s architecture is arguably more scalable: because adding more experts doesn’t linearly increase inference cost, DeepSeek could increase total parameters to, say, 1 trillion while keeping 37B active, whereas a dense model of that scale would be impractical to use. Another differentiator is fine-tuning and openness: Meta’s Llama 2 was fine-tuned with supervised instruction data and some RLHF, and released with usage restrictions; DeepSeek-V3 was post-trained with RL and fully open-source (MIT license). In essence, Meta’s strategy provided the weights to the community but with strings attached, while DeepSeek provided both weights and broad usage rights, plus a novel architecture. Meta’s latest models might integrate some MoE ideas in the future (especially seeing DeepSeek’s success), but as of Llama 2, Meta stuck to dense models. In terms of training cost, Meta reportedly spent on the order of $20–30 million for Llama 2 (70B) and likely more for Llama 3 (which targeted 2 trillion tokens on 16,000 H100 GPUs). DeepSeek-V3’s entire 14.8T token pretraining run cost only ~$5.6 million, highlighting how algorithmic innovations beat scale in cost-performance. This has put pressure on Meta and others to incorporate similar efficiency techniques or risk being outpaced by the open models in the long run.
Anthropic: Anthropic’s approach with its Claude series (e.g. Claude 2, Claude 3.5) has been characterized by an emphasis on safety, interpretability, and extensive alignment tuning. The exact architectural details of Claude are proprietary (Anthropic has not disclosed parameter counts, but estimates range around 100–150B dense parameters for Claude 2). Claude is known for extremely long context (up to 100K tokens) and for employing Constitutional AI, an RL-based alignment method that uses a set of guiding principles to refine the model’s behavior without direct human feedback on every example. In terms of capability, Claude 3.5 is highly proficient, achieving 88.3% on MMLU and 81.7% on coding benchmarks (HumanEval) – nearly identical to DeepSeek V3’s levels.
However, these results come with Anthropic’s model being closed-source and available only via API, tightly controlled for safety. The key difference with DeepSeek is that Anthropic trades some openness and perhaps raw power for alignment: Claude is praised for its reliability and harmlessness, but its training is compute-intensive and data-intensive in similar ways to OpenAI’s (Anthropic reportedly used many thousand H100 GPUs). DeepSeek-R1, conversely, was trained for reasoning with minimal human supervision, which yielded unparalleled reasoning depth at the cost of sometimes less polished outputs. Anthropic might view DeepSeek’s fully open release as risky (given their focus on “responsible scaling”), but the success of DeepSeek’s methods could influence Anthropic and others to explore automated reasoning enhancement techniques in addition to human feedback. It’s worth noting that Anthropic and OpenAI have not prominently used MoE in their known models – their gains have come largely from scaling dense models and improving training data and alignment. If DeepSeek’s hybrid of sparse expert scaling and RL-based training proves superior in the long run, it may lead even these companies to adapt their model designs. In summary, Anthropic’s Claude and DeepSeek’s models currently stand at a similar performance tier on benchmarks, but DeepSeek’s open, efficient architecture represents a divergent philosophy: one prioritizing research accessibility and efficiency over tightly-guarded safety-centric development. Each approach has its merits, and the coming years will reveal whether DeepSeek’s innovations become the new norm adopted by all, or if closed models maintain an edge via superior fine-tuning and safety.

Importantly, DeepSeek’s technical advances have translated into state-of-the-art performance on many AI benchmarks, validating its approaches. DeepSeek-R1 in particular has demonstrated remarkable reasoning and problem-solving capabilities. On the MATH-500 test of advanced mathematics, R1 achieves a Pass@1 accuracy of 97.3%, essentially matching OpenAI’s finetuned model (O1-1217) on that benchmark. It also scored 79.8% on the challenging AIME 2024 math competition. In coding, R1 reaches an Elo rating of 2029 on Codeforces, placing it in the top-tier of competition participants, and performs strongly on software engineering benchmarks like SWE-bench (Software Engineer benchmark) and LiveCodeEval. These results rival those of OpenAI’s and DeepMind’s best code-capable models. Moreover, R1 exhibits strengths in general reasoning: for instance, it scored 71.5% on the GPQA-Diamond benchmark (a difficult QA task) and excelled in creative tasks with an 87.6% win rate on AlpacaEval 2.0 and 92.3% on the Arena benchmark for complex dialog. Such performance, especially in unprompted chain-of-thought and self-correction, stems from the RL-driven training regimen. Observers noted that R1 can generate very lengthy reasoning traces for a query, essentially “thinking out loud,” and even perform self-verification by cross-checking its intermediate answers. These are abilities that were not explicitly present in its supervised training data but emerged through reinforcement learning – a testament to DeepSeek’s training innovations. In side-by-side comparisons, R1 often provides more detailed explanations for its answers than competing models like OpenAI’s, even if the final answer isn’t always more accurate. The net impact is that DeepSeek’s techniques have pushed the frontier of reasoning in AI, achieving benchmark supremacy in several categories and doing so with a model that the entire research world can inspect and build upon.

Meta-Pensées’s Substack

Discussion about this post