OpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model

01/29/2026

DeepSeek Releases DeepSeek-OCR-2 - Document Understanding Model with Visual Causal Flow

On January 27, 2026, DeepSeek officially released the latest open-source model DeepSeek-OCR-2, introducing the new DeepEncoder V2 vision encoder. This encoder architecture breaks free from traditional models’ fixed scanning order (top-left to bottom-right), instead mimicking human visual “Causal Flow” logic, allowing AI to dynamically rearrange image segments based on image meaning.

Core Innovation: Visual Causal Flow

Breaking Fixed Scanning Order

Traditional Vision-Language Models (VLMs) typically process images in a fixed raster scan order (top-left to bottom-right). This rigid approach doesn’t align with human visual perception. Humans scan flexibly based on content, and when processing complex layouts like tables, formulas, and multi-column text, fixed scanning introduces erroneous information.

DeepSeek-OCR-2 utilizes the new DeepEncoder V2 encoder, giving the model “Visual Causal Flow” capability, allowing it to dynamically reorder visual tokens based on image content.

DeepEncoder V2 Architecture

DeepEncoder V2 employs a customized Attention Mask strategy:

Visual Token Section

Retains bidirectional attention mechanism
Ensures the model has a global receptive field like CLIP
Captures overall image features

Causal Flow Token Section

Adopts causal attention mechanism (similar to Decoder-only LLM)
Each query token can only attend to previous tokens
Achieves intelligent reordering of visual information

Through this design, visual tokens maintain global information interaction, while causal flow tokens gain the ability to reorder visual information.

Built on Qwen2-0.5B

In implementation, the DeepSeek team uses Qwen2-0.5B to instantiate this architecture, introducing lightweight language model causal reasoning capabilities into the visual encoding stage.

Technical Architecture

Two-Stage Reasoning Loop

DeepSeek-OCR-2 demonstrates a “two cascaded 1D causal reasoners” pattern:

First Stage (Encoder): Reading logic reasoning
- Completes semantic ordering within DeepEncoder V2
- Dynamically adjusts token order based on document structure
Second Stage (Decoder): Visual task reasoning
- Focuses on autoregressive generation in the decoder
- Generates text based on reordered visual information

This approach decomposes 2D understanding into two complementary sub-tasks, representing a breakthrough architectural method for achieving true 2D reasoning.

Multi-Crop Strategy

DeepSeek-OCR-2 employs a multi-crop strategy:

Varies based on image resolution
Final reordered visual tokens input to LLM range from 256 to 1120
Generates 256 coarse-grained queries at 1024×1024
Generates 144 high-precision queries per block in 768×768 detailed regions

This ensures zero loss of details like formulas, stamps, and small text annotations.

Visual Tokenizer Optimization

Uses 80M parameter SAM-base architecture
Output dimension compressed from 1024 to 896
Combined with 16x token compression ratio
Significantly reduces global attention computation overhead

Decoder Architecture

Continues 3B MoE sparse architecture
Actual activation only about 500M parameters
Balances performance with deployment cost

Performance

OmniDocBench v1.5

On the authoritative benchmark covering 9 major categories with 1,355 pages of documents including magazines, papers, and whitepapers:

Overall Accuracy: 91.09% (record-breaking)
Improvement over Previous Generation: 3.73%
Reading Order Edit Distance: Reduced from 0.085 to 0.057

Production Environment Performance

Online Service Repetition Rate Reduction: 33% (6.25% → 4.17%)
PDF Production Data Repetition Rate Reduction: 22% (3.69% → 2.88%)

Comparison with Gemini-3 Pro

In document parsing edit distance:

DeepSeek-OCR-2: 0.100
Gemini-3 Pro: 0.115

Reading order accuracy improved by over 34%.

Training Strategy

Data Distribution Optimization

OCR Data Proportion: 80%
Text/Formula/Table Sampling Ratio: 3:1:1
Merges semantically similar labels like “captions/titles”
Significantly improves generalization for real-world scenarios like academic PDFs, financial reports, and tender documents

Chinese Document Optimization

Training strategy better understands Chinese document characteristics, performing excellently when processing complex Chinese layouts.

Application Scenarios

DeepSeek-OCR-2 is particularly suitable for:

Academic Document Processing

Paper PDF to Markdown conversion
Complex formula recognition
Multi-column layout understanding
Reference extraction

Business Document Analysis

Financial statement parsing
Contract text extraction
Tender document processing
Invoice recognition

Technical Documentation Conversion

Technical manual digitization
API documentation extraction
Code comment recognition

Multilingual Documents

Supports 100+ languages
Mixed language document processing
Maintains original format structure

Technical Significance

Toward Unified Multimodal Encoder

The DeepSeek team believes this provides a promising path toward a unified multimodal encoder. In the future, a single encoder might achieve feature extraction and compression for images, audio, and text within the same parameter space by configuring modality-specific learnable queries.

New Paradigm for Visual Encoding

If DeepSeek-OCR 1 made the industry first realize that “visual compression” might be a seriously underestimated technical route, then DeepSeek-OCR-2 clearly decided to take this path more aggressively.

DeepEncoder V2 no longer views visual encoding as a static, fixed-strategy scanning process, but introduces a semantically-driven dynamic encoding mechanism. The model begins judging which regions are more likely to carry key information during the encoding stage and adjusts visual token allocation and expression accordingly.

In other words, visual encoding is no longer just “preprocessing” but has already entered the “understanding stage” in advance.

Open Source and Availability

DeepSeek-OCR-2 is fully open-source, providing:

Model weights
Complete code
Technical report

Access

GitHub Project: https://github.com/deepseek-ai/DeepSeek-OCR-2
HuggingFace Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
Technical Paper: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf

Community Support

Community developers have already provided ComfyUI integration for DeepSeek-OCR-2:

ComfyUI-DeepSeek-OCR: https://github.com/1038lab/ComfyUI-DeepSeek-OCR

Although currently in V0.0.1 beta status, it provides convenient usage for ComfyUI users.

GitHub Repository: https://github.com/deepseek-ai/DeepSeek-OCR-2
HuggingFace Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
Technical Paper: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf
ComfyUI Plugin: https://github.com/1038lab/ComfyUI-DeepSeek-OCR