DeepSeek's "Thinking with Visual Primitives" — Giving AI a Finger for Spatial Reasoning
DeepSeek quietly published one of the most important multimodal AI papers of 2026 — then just as quietly deleted it. The paper, titled "Thinking with Visual Primitives," introduces a fundamentally new approach to how AI models reason about images, and it solves a problem most people didn't even know existed.
The Reference Gap: Why Language Fails at Visual Reasoning
Currently, multimodal AI models use language as the medium for visual reasoning. When GPT-5.4 or Claude Sonnet 4.6 tries to count people in a crowded photo, it narrates its thought process in words: "The person on the left... the person slightly to the right of center..." This sounds natural, but it's fundamentally unreliable. Language description is ambiguous — pronouns drift, spatial references lose precision, and by the third sentence the model has no clean way to refer back to that specific person it mentioned earlier.
DeepSeek's team identified this as the "Reference Gap" — distinct from the "Perception Gap" that most 2024-era multimodal work tried to solve. High-resolution cropping, dynamic patching, thinking with images — all of that is about seeing better. Visual primitives are about pointing better.
"Coarse-grained counting has been resolved by the Community. Fine-grained counting remains unsolved."
— Xiaokang Chen, DeepSeek researcher
The distinction matters. Seeing clearly and pointing precisely are two different cognitive skills. DeepSeek's key insight: give the model a dedicated mechanism for reference.
Visual Primitives as Inline Tokens in Chain-of-Thought
The central innovation is surprisingly elegant. Instead of having the model describe spatial relationships in words, DeepSeek introduced spatial tokens — bounding box coordinates (<|box|>) and point coordinates (<|point|>) — as first-class tokens in the model's vocabulary.
These aren't function calls or tool lookups. They appear inline in the chain of thought, the same way a human might circle something on a whiteboard mid-sentence.
When the model works through "count the bears in this photo," it doesn't just narrate:
"Found a bear<|ref|>bear<|/ref|><|box|>[[452,23,804,411]]<|/box|>. It's climbing a tree, so excluded. Looking further down and left, found another<|ref|>bear<|/ref|><|box|>[[50,447,647,771]]<|/box|>on a rock."
Each bear gets a coordinate anchor. The model can refer back to it. Reference doesn't drift. This is like giving AI a finger to point with.
There are two types of primitives:
- Bounding boxes (
<|box|>) — for objects needing location and size - Point coordinates (
<|point|>) — for abstract spatial references like maze trajectories or curve-tracing paths
The Architecture: DeepSeek V4 Flash Backbone
The model is built on DeepSeek V4 Flash — a 284-billion parameter mixture-of-experts architecture with only 13 billion parameters active at inference. This means frontier-grade reasoning at a fraction of the compute cost.
What makes the training pipeline particularly innovative is its five-stage, specialist-first approach:
Stage 1 — Multimodal Pre-training: Standard pre-training on trillions of tokens.
Stage 2 — Split SFT: Instead of training one model on both grounding tasks, DeepSeek trains two separate specialist models — one for bounding boxes and one for coordinate points.
Stage 3 — GRPO with Three Reward Heads: Each specialist undergoes reinforcement learning with Group Relative Policy Optimization using three separate reward signals: format (valid token output), quality (useful intermediate reasoning), and accuracy (correct final answer).
Stage 4 — Unified RFT: The two specialists are merged through reinforced fine-tuning.
Stage 5 — On-policy Distillation: The unified model is distilled into a single student model.
This separation of concerns is a genuinely new approach. Most labs do a single SFT pass on combined data followed by RLHF. DeepSeek's pipeline trains specialists first, then consolidates — applying the mixture-of-experts philosophy to the training process itself.
Wild Efficiency: 7,056x Visual Compression
One of the most striking numbers in the paper is the compression ratio. For a 756x756 pixel image:
- The ViT encoder generates 2,916 image patch tokens
- A 3x3 spatial compression merges these into 324 tokens
- DeepSeek V4 Flash's built-in Compressed Sparse Attention (CSA) further compresses the KV cache by 4x, leaving only 81 visual KV entries
That's a 7,056x compression ratio from raw pixels to cache entries.
For context, the same 800x800 image requires approximately 870 KV entries for Claude Sonnet 4.6 and about 1,100 for Gemini 3 Flash. DeepSeek does it with 90.
The paper's argument: precise spatial referencing ability can, to some extent, compensate for fewer visual tokens. The model doesn't need to "see more" — it needs to "point more accurately."
Benchmark Performance
The headline result is maze navigation, where DeepSeek scores 67% against:
- Gemini Flash 3: 49%
- GPT-5.4: 50%
- Claude Sonnet 4.6: 49%
That's roughly a 17-point gap over GPT-5.4 on a task requiring path-following through spatial structure. Path tracing shows a similar advantage. Counting and general spatial reasoning are more mixed — Gemini 3 Flash remains competitive on raw count QA.
The paper includes an unusually honest footnote: "reported scores cover only a subset of evaluation dimensions directly relevant to the research focus of this paper and are therefore not indicative of the model's overall capabilities." They aren't claiming to beat GPT-5.4 across the board — they're claiming to beat it on visually grounded spatial reasoning. That's a narrower, more defensible claim, and the transparency is refreshing.
The Paper That Disappeared
The paper appeared on arXiv and GitHub on April 30, 2026, alongside a teaser post from DeepSeek researcher Xiaokang Chen showing the company's whale logo lifting a blindfold to reveal eyes — captioned "Now, we see you."
Within days, the paper was pulled from the DeepSeek GitHub repository with no explanation. Community members quickly mirrored it, and the document remains accessible through third-party repos and Hugging Face archives. DeepSeek has not commented on why it was removed.
The timing is curious. DeepSeek is currently rolling out multimodal capabilities in limited beta, and early testers report impressions that line up with the paper's claims — notably precise object counting and the ability to recreate web pages from screenshots or Figma exports. One tester gave DeepSeek a photo near their office and watched the reasoning log work through every building in the surrounding area from memory, with no internet connection required.
What This Means for the Multimodal Landscape
This paper represents a genuine architectural shift in how multimodal models handle vision. Here's why it matters:
- First-principles thinking: Most multimodal advances have focused on better encoders or more data. DeepSeek rethought the fundamental question of how a model should represent visual space during reasoning.
- Efficiency as a feature: The 7,056x compression isn't just impressive — it's strategic. Running fewer KV entries per image means cheaper inference, lower latency, and easier scaling. This is how you build vision models that don't need a cluster of H100s per request.
- The specialist approach: Training separate models for different grounding modalities and merging them later is likely to become more common. It sidesteps the interference problems that arise when one model tries to learn multiple spatial representations simultaneously.
- Chinese AI closing the gap: This paper follows a pattern of genuine innovation from Chinese labs — DeepSeek's earlier Janus architecture for decoupled visual encoding, and now this. The Stanford AI Index 2026 report notes that the U.S. model lead has shrunk to just 2.7% as of March 2026. DeepSeek is a major reason why.
Frequently Asked Questions
What are visual primitives?
Visual primitives are spatial tokens (bounding box coordinates and point coordinates) that the model uses inline during its chain-of-thought reasoning. They serve as precise, non-ambiguous references to objects in an image — like an AI's finger for pointing.
Why was the paper deleted?
DeepSeek has not provided a reason for pulling the paper and code from their GitHub repository. Community mirrors remain available, and the technical report can still be accessed through archived copies.
Is this model available to use?
The visual reasoning capabilities are rolling out in limited beta on the DeepSeek website and app. The model weights will be integrated into DeepSeek's foundation model and released in the future. The paper states the team plans to release in-house benchmarks and a subset of training data publicly.
How does this compare to GPT-5.4 or Claude Sonnet 4.6?
On specific spatial reasoning tasks like maze navigation and path tracing, DeepSeek significantly outperforms both (67% vs 50% and 49% respectively). On general vision-language tasks outside the paper's focus, comparisons aren't available — the authors explicitly scope their claims to spatial and visual reasoning benchmarks.
Does the model need more GPUs to run?
No — in fact, the opposite. The 7,056x visual compression means it runs on fewer KV cache entries than competitors, making inference cheaper and faster despite its 284B total parameter count.
← Back to all posts