Skip to content
Follow me on X
ComfyUI Wiki
NewsDeepSeek Releases DeepSeek-OCR-2 - Document Understanding Model with Visual Causal Flow

DeepSeek Releases DeepSeek-OCR-2 - Document Understanding Model with Visual Causal Flow

On January 27, 2026, DeepSeek officially released the latest open-source model DeepSeek-OCR-2, introducing the new DeepEncoder V2 vision encoder. This encoder architecture breaks free from traditional models’ fixed scanning order (top-left to bottom-right), instead mimicking human visual “Causal Flow” logic, allowing AI to dynamically rearrange image segments based on image meaning.

Core Innovation: Visual Causal Flow

Breaking Fixed Scanning Order

Traditional Vision-Language Models (VLMs) typically process images in a fixed raster scan order (top-left to bottom-right). This rigid approach doesn’t align with human visual perception. Humans scan flexibly based on content, and when processing complex layouts like tables, formulas, and multi-column text, fixed scanning introduces erroneous information.

DeepSeek-OCR-2 utilizes the new DeepEncoder V2 encoder, giving the model “Visual Causal Flow” capability, allowing it to dynamically reorder visual tokens based on image content.

DeepEncoder V2 Architecture

DeepEncoder V2 employs a customized Attention Mask strategy:

Visual Token Section

  • Retains bidirectional attention mechanism
  • Ensures the model has a global receptive field like CLIP
  • Captures overall image features

Causal Flow Token Section

  • Adopts causal attention mechanism (similar to Decoder-only LLM)
  • Each query token can only attend to previous tokens
  • Achieves intelligent reordering of visual information

Through this design, visual tokens maintain global information interaction, while causal flow tokens gain the ability to reorder visual information.

Built on Qwen2-0.5B

In implementation, the DeepSeek team uses Qwen2-0.5B to instantiate this architecture, introducing lightweight language model causal reasoning capabilities into the visual encoding stage.

Technical Architecture

Two-Stage Reasoning Loop

DeepSeek-OCR-2 demonstrates a “two cascaded 1D causal reasoners” pattern:

  1. First Stage (Encoder): Reading logic reasoning

    • Completes semantic ordering within DeepEncoder V2
    • Dynamically adjusts token order based on document structure
  2. Second Stage (Decoder): Visual task reasoning

    • Focuses on autoregressive generation in the decoder
    • Generates text based on reordered visual information

This approach decomposes 2D understanding into two complementary sub-tasks, representing a breakthrough architectural method for achieving true 2D reasoning.

Multi-Crop Strategy

DeepSeek-OCR-2 employs a multi-crop strategy:

  • Varies based on image resolution
  • Final reordered visual tokens input to LLM range from 256 to 1120
  • Generates 256 coarse-grained queries at 1024×1024
  • Generates 144 high-precision queries per block in 768×768 detailed regions

This ensures zero loss of details like formulas, stamps, and small text annotations.

Visual Tokenizer Optimization

  • Uses 80M parameter SAM-base architecture
  • Output dimension compressed from 1024 to 896
  • Combined with 16x token compression ratio
  • Significantly reduces global attention computation overhead

Decoder Architecture

  • Continues 3B MoE sparse architecture
  • Actual activation only about 500M parameters
  • Balances performance with deployment cost

Performance

OmniDocBench v1.5

On the authoritative benchmark covering 9 major categories with 1,355 pages of documents including magazines, papers, and whitepapers:

  • Overall Accuracy: 91.09% (record-breaking)
  • Improvement over Previous Generation: 3.73%
  • Reading Order Edit Distance: Reduced from 0.085 to 0.057

Production Environment Performance

  • Online Service Repetition Rate Reduction: 33% (6.25% → 4.17%)
  • PDF Production Data Repetition Rate Reduction: 22% (3.69% → 2.88%)

Comparison with Gemini-3 Pro

In document parsing edit distance:

  • DeepSeek-OCR-2: 0.100
  • Gemini-3 Pro: 0.115

Reading order accuracy improved by over 34%.

Training Strategy

Data Distribution Optimization

  • OCR Data Proportion: 80%
  • Text/Formula/Table Sampling Ratio: 3:1:1
  • Merges semantically similar labels like “captions/titles”
  • Significantly improves generalization for real-world scenarios like academic PDFs, financial reports, and tender documents

Chinese Document Optimization

Training strategy better understands Chinese document characteristics, performing excellently when processing complex Chinese layouts.

Application Scenarios

DeepSeek-OCR-2 is particularly suitable for:

Academic Document Processing

  • Paper PDF to Markdown conversion
  • Complex formula recognition
  • Multi-column layout understanding
  • Reference extraction

Business Document Analysis

  • Financial statement parsing
  • Contract text extraction
  • Tender document processing
  • Invoice recognition

Technical Documentation Conversion

  • Technical manual digitization
  • API documentation extraction
  • Code comment recognition

Multilingual Documents

  • Supports 100+ languages
  • Mixed language document processing
  • Maintains original format structure

Technical Significance

Toward Unified Multimodal Encoder

The DeepSeek team believes this provides a promising path toward a unified multimodal encoder. In the future, a single encoder might achieve feature extraction and compression for images, audio, and text within the same parameter space by configuring modality-specific learnable queries.

New Paradigm for Visual Encoding

If DeepSeek-OCR 1 made the industry first realize that “visual compression” might be a seriously underestimated technical route, then DeepSeek-OCR-2 clearly decided to take this path more aggressively.

DeepEncoder V2 no longer views visual encoding as a static, fixed-strategy scanning process, but introduces a semantically-driven dynamic encoding mechanism. The model begins judging which regions are more likely to carry key information during the encoding stage and adjusts visual token allocation and expression accordingly.

In other words, visual encoding is no longer just “preprocessing” but has already entered the “understanding stage” in advance.

Open Source and Availability

DeepSeek-OCR-2 is fully open-source, providing:

  • Model weights
  • Complete code
  • Technical report

Access

Community Support

Community developers have already provided ComfyUI integration for DeepSeek-OCR-2:

Although currently in V0.0.1 beta status, it provides convenient usage for ComfyUI users.