Why Exactly is DeepSeek R1 Such A Revolution In AI?
Deepseek R1 still seems to be a topic of countless posts and discussions. Since people seem to be curious, here’s a short breakdown of what makes it stand out. We'll be skimming the surface and cutting a few corners, but hopefully, this should still offer a few practical insights into the mystery. (Also, this is my best effort, as I am not an AI researcher your mileage may vary)
If you’re unfamiliar with Deepseek R1, it’s a reasoning AI model from China that seemingly came out of nowhere. It seems to rival leading Western models while being trained on a fraction of the GPUs typically used by Western AI shops.
Fine-Tuning
Why though? For starters, it skips the de-facto traditional supervised fine-tuning and instead learns purely through reinforcement learning. In practice, this means it learns by itself how to solve the problems it is given rather than mimicking the often suboptimal solutions devised by the meat-bags that are humans. While this approach has been used in narrow AI models before (think AlphaGo etc.), no one has successfully applied it to a generalized model—until now.
Optimization
Since the company has fewer GPUs, they had to focus on optimizations. Deepseek skips NVIDIA's standard libraries and instead uses what is essentially "assembly" for direct communication with the GPUs in the cluster. This significantly improves both training and inference efficiency, but it also means the solution is not easily transferable to another hardware setup.
Mixture-of-Experts
Deepseek R1 also employs a mixture-of-experts approach—a combination of specialized models that are selectively activated when needed. While this technique is widely used, Deepseek takes it to a new level. Typically, models use 8 to 16 experts, with 4 activated at a time, meaning a substantial portion of the model is engaged per request. In contrast, Deepseek operates with 256 experts, activating only 8 at a time, resulting in significantly lower overall model activation per request. Such a large number of experts presents a non-trivial problem for load balancing throughout the model. In practice, this means a more efficient model and more specialized experts.
All of this results in a smaller model that is faster to train and significantly more efficient to run. As a result, Western AI companies are now racing to replicate the same approach. And this is not even all that Deepseek (R1) brings to the table.