Follow me on XComfyUI WikiCommunity-maintained, non-official ComfyUI Wiki. Official ComfyUI documentation is separate from this site.
Back to News

NVIDIA Releases LocateAnything-3B - Open-Source Vision-Language Grounding Model with Parallel Box Decoding

news

NVIDIA open-sources LocateAnything-3B, a vision-language grounding model featuring Parallel Box Decoding (PBD) for fast and precise object localization, supporting object detection, GUI element grounding, OCR localization, and point-based grounding across diverse domains

On June 29, 2026, NVIDIA officially released LocateAnything-3B, an open-source vision-language grounding model that enables fast and high-quality visual localization from natural language instructions. The model introduces Parallel Box Decoding (PBD), a novel decoding paradigm that predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, achieving up to 2.5× higher throughput compared to prior approaches.

LocateAnything teaser LocateAnything enables precise object localization across diverse domains including natural scenes, robotics, GUI interaction, and document understanding.

Model Overview

LocateAnything is a generalist vision-language grounding model developed as part of NVIDIA's Eagle VLM model family. It supports a wide range of localization tasks:

  • Referring Expression Grounding: Locate objects described by natural language
  • Open-Set Object Detection: Detect common and long-tail object categories
  • GUI Element Grounding: Localize UI elements for agentic systems
  • Document Layout Grounding: OCR and text localization
  • Point-Based Localization: Fine-grained spatial reasoning via pointing

The model has been integrated into NVIDIA's Nemotron and Cosmos product lines, powering computer use and visual grounding features.

Core Innovation: Parallel Box Decoding (PBD)

Traditional visual grounding models generate bounding box coordinates autoregressively, token by token. LocateAnything introduces Parallel Box Decoding:

  • Predicts complete bounding boxes (x1, y1, x2, y2) and points in parallel structured units
  • Uses a block-wise multi-token prediction framework
  • Achieves 2.5× higher throughput without sacrificing geometric consistency
  • Supports three inference modes:
    • Fast Mode: Parallel decoding for maximum speed
    • Slow Mode: Autoregressive decoding for maximum accuracy
    • Hybrid Mode: Default; parallel decoding with fallback to autoregressive for format irregularities

Technical Architecture

ComponentDetails
ArchitectureTransformer-based VLM
Vision EncoderMoonViT (native resolution, up to 2.5K)
Language ModelQwen2.5-3B-Instruct
Multimodal ProjectorMLP projector
Total Parameters3B
Max Image Resolution2.5K (production), up to 4K with batch inference
Max Sequence Length25,600 tokens (training), 8,192 generation tokens (inference)
Output FormatBlock-based: Semantic, Box, Negative, and End blocks

Training Data

  • 12M unique images, 138M+ queries, 785M bounding boxes
  • Multi-domain: natural scenes, robotics, driving, GUI, documents
  • Hybrid data sources: human-curated, open-source, model-assisted synthetic annotations

Performance

LocateAnything demonstrates strong performance across multiple grounding benchmarks including COCO/LVIS for open-set detection, ScreenSpot-Pro for GUI grounding, and various document layout understanding benchmarks.

Inference Efficiency

Using the la_flash attention backend with batch hybrid inference:

BackendTime (4K probe)Peak Memory
SDPA (dense masks)8.26s35.12 GB
la_flash (FlashAttention)8.03s11.71 GB

Open Source and Availability

LocateAnything-3B is released under the NVIDIA License for non-commercial research and development use:

Hardware Requirements

Optimized for NVIDIA GPUs (Ampere, Blackwell, Hopper, Lovelace) with BF16 precision and KV cache. Batch inference via la_flash backend reduces peak memory from 35GB to ~12GB on A100.

NVIDIA Releases LocateAnything-3B - Open-Source Vision-Language Grounding Model with Parallel Box Decoding | ComfyUI Wiki