DeepSeek Releases DeepSeek-OCR-2 - Document Understanding Model with Visual Causal Flow
On January 27, 2026, DeepSeek officially released the latest open-source model DeepSeek-OCR-2, introducing the new DeepEncoder V2 vision encoder. This encoder architecture breaks free from traditional models’ fixed scanning order (top-left to bottom-right), instead mimicking human visual “Causal Flow” logic, allowing AI to dynamically rearrange image segments based on image meaning.
Core Innovation: Visual Causal Flow
Breaking Fixed Scanning Order
Traditional Vision-Language Models (VLMs) typically process images in a fixed raster scan order (top-left to bottom-right). This rigid approach doesn’t align with human visual perception. Humans scan flexibly based on content, and when processing complex layouts like tables, formulas, and multi-column text, fixed scanning introduces erroneous information.
DeepSeek-OCR-2 utilizes the new DeepEncoder V2 encoder, giving the model “Visual Causal Flow” capability, allowing it to dynamically reorder visual tokens based on image content.
DeepEncoder V2 Architecture
DeepEncoder V2 employs a customized Attention Mask strategy:
Visual Token Section
- Retains bidirectional attention mechanism
- Ensures the model has a global receptive field like CLIP
- Captures overall image features
Causal Flow Token Section
- Adopts causal attention mechanism (similar to Decoder-only LLM)
- Each query token can only attend to previous tokens
- Achieves intelligent reordering of visual information
Through this design, visual tokens maintain global information interaction, while causal flow tokens gain the ability to reorder visual information.
Built on Qwen2-0.5B
In implementation, the DeepSeek team uses Qwen2-0.5B to instantiate this architecture, introducing lightweight language model causal reasoning capabilities into the visual encoding stage.
Technical Architecture
Two-Stage Reasoning Loop
DeepSeek-OCR-2 demonstrates a “two cascaded 1D causal reasoners” pattern:
-
First Stage (Encoder): Reading logic reasoning
- Completes semantic ordering within DeepEncoder V2
- Dynamically adjusts token order based on document structure
-
Second Stage (Decoder): Visual task reasoning
- Focuses on autoregressive generation in the decoder
- Generates text based on reordered visual information
This approach decomposes 2D understanding into two complementary sub-tasks, representing a breakthrough architectural method for achieving true 2D reasoning.
Multi-Crop Strategy
DeepSeek-OCR-2 employs a multi-crop strategy:
- Varies based on image resolution
- Final reordered visual tokens input to LLM range from 256 to 1120
- Generates 256 coarse-grained queries at 1024×1024
- Generates 144 high-precision queries per block in 768×768 detailed regions
This ensures zero loss of details like formulas, stamps, and small text annotations.
Visual Tokenizer Optimization
- Uses 80M parameter SAM-base architecture
- Output dimension compressed from 1024 to 896
- Combined with 16x token compression ratio
- Significantly reduces global attention computation overhead
Decoder Architecture
- Continues 3B MoE sparse architecture
- Actual activation only about 500M parameters
- Balances performance with deployment cost
Performance
OmniDocBench v1.5
On the authoritative benchmark covering 9 major categories with 1,355 pages of documents including magazines, papers, and whitepapers:
- Overall Accuracy: 91.09% (record-breaking)
- Improvement over Previous Generation: 3.73%
- Reading Order Edit Distance: Reduced from 0.085 to 0.057
Production Environment Performance
- Online Service Repetition Rate Reduction: 33% (6.25% → 4.17%)
- PDF Production Data Repetition Rate Reduction: 22% (3.69% → 2.88%)
Comparison with Gemini-3 Pro
In document parsing edit distance:
- DeepSeek-OCR-2: 0.100
- Gemini-3 Pro: 0.115
Reading order accuracy improved by over 34%.
Training Strategy
Data Distribution Optimization
- OCR Data Proportion: 80%
- Text/Formula/Table Sampling Ratio: 3:1:1
- Merges semantically similar labels like “captions/titles”
- Significantly improves generalization for real-world scenarios like academic PDFs, financial reports, and tender documents
Chinese Document Optimization
Training strategy better understands Chinese document characteristics, performing excellently when processing complex Chinese layouts.
Application Scenarios
DeepSeek-OCR-2 is particularly suitable for:
Academic Document Processing
- Paper PDF to Markdown conversion
- Complex formula recognition
- Multi-column layout understanding
- Reference extraction
Business Document Analysis
- Financial statement parsing
- Contract text extraction
- Tender document processing
- Invoice recognition
Technical Documentation Conversion
- Technical manual digitization
- API documentation extraction
- Code comment recognition
Multilingual Documents
- Supports 100+ languages
- Mixed language document processing
- Maintains original format structure
Technical Significance
Toward Unified Multimodal Encoder
The DeepSeek team believes this provides a promising path toward a unified multimodal encoder. In the future, a single encoder might achieve feature extraction and compression for images, audio, and text within the same parameter space by configuring modality-specific learnable queries.
New Paradigm for Visual Encoding
If DeepSeek-OCR 1 made the industry first realize that “visual compression” might be a seriously underestimated technical route, then DeepSeek-OCR-2 clearly decided to take this path more aggressively.
DeepEncoder V2 no longer views visual encoding as a static, fixed-strategy scanning process, but introduces a semantically-driven dynamic encoding mechanism. The model begins judging which regions are more likely to carry key information during the encoding stage and adjusts visual token allocation and expression accordingly.
In other words, visual encoding is no longer just “preprocessing” but has already entered the “understanding stage” in advance.
Open Source and Availability
DeepSeek-OCR-2 is fully open-source, providing:
- Model weights
- Complete code
- Technical report
Access
- GitHub Project: https://github.com/deepseek-ai/DeepSeek-OCR-2
- HuggingFace Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
- Technical Paper: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf
Community Support
Community developers have already provided ComfyUI integration for DeepSeek-OCR-2:
- ComfyUI-DeepSeek-OCR: https://github.com/1038lab/ComfyUI-DeepSeek-OCR
Although currently in V0.0.1 beta status, it provides convenient usage for ComfyUI users.
Related Links
- GitHub Repository: https://github.com/deepseek-ai/DeepSeek-OCR-2
- HuggingFace Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
- Technical Paper: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf
- ComfyUI Plugin: https://github.com/1038lab/ComfyUI-DeepSeek-OCR