NVIDIA Releases LocateAnything-3B - Open-Source Vision-Language Grounding Model with Parallel Box Decoding

On June 29, 2026, NVIDIA officially released LocateAnything-3B, an open-source vision-language grounding model that enables fast and high-quality visual localization from natural language instructions. The model introduces Parallel Box Decoding (PBD), a novel decoding paradigm that predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, achieving up to 2.5× higher throughput compared to prior approaches.

LocateAnything enables precise object localization across diverse domains including natural scenes, robotics, GUI interaction, and document understanding.

Model Overview

LocateAnything is a generalist vision-language grounding model developed as part of NVIDIA's Eagle VLM model family. It supports a wide range of localization tasks:

Referring Expression Grounding: Locate objects described by natural language
Open-Set Object Detection: Detect common and long-tail object categories
GUI Element Grounding: Localize UI elements for agentic systems
Document Layout Grounding: OCR and text localization
Point-Based Localization: Fine-grained spatial reasoning via pointing

The model has been integrated into NVIDIA's Nemotron and Cosmos product lines, powering computer use and visual grounding features.

Core Innovation: Parallel Box Decoding (PBD)

Traditional visual grounding models generate bounding box coordinates autoregressively, token by token. LocateAnything introduces Parallel Box Decoding:

Predicts complete bounding boxes (x1, y1, x2, y2) and points in parallel structured units
Uses a block-wise multi-token prediction framework
Achieves 2.5× higher throughput without sacrificing geometric consistency
Supports three inference modes:
- Fast Mode: Parallel decoding for maximum speed
- Slow Mode: Autoregressive decoding for maximum accuracy
- Hybrid Mode: Default; parallel decoding with fallback to autoregressive for format irregularities

Technical Architecture

Component	Details
Architecture	Transformer-based VLM
Vision Encoder	MoonViT (native resolution, up to 2.5K)
Language Model	Qwen2.5-3B-Instruct
Multimodal Projector	MLP projector
Total Parameters	3B
Max Image Resolution	2.5K (production), up to 4K with batch inference
Max Sequence Length	25,600 tokens (training), 8,192 generation tokens (inference)
Output Format	Block-based: Semantic, Box, Negative, and End blocks

Training Data

12M unique images, 138M+ queries, 785M bounding boxes
Multi-domain: natural scenes, robotics, driving, GUI, documents
Hybrid data sources: human-curated, open-source, model-assisted synthetic annotations

Performance

LocateAnything demonstrates strong performance across multiple grounding benchmarks including COCO/LVIS for open-set detection, ScreenSpot-Pro for GUI grounding, and various document layout understanding benchmarks.

Inference Efficiency

Using the la_flash attention backend with batch hybrid inference:

Backend	Time (4K probe)	Peak Memory
SDPA (dense masks)	8.26s	35.12 GB
la_flash (FlashAttention)	8.03s	11.71 GB

Open Source and Availability

LocateAnything-3B is released under the NVIDIA License for non-commercial research and development use:

HuggingFace Model: nvidia/LocateAnything-3B
GitHub Code: NVlabs/Eagle/Embodied
Online Demo: HuggingFace Spaces
Technical Report: arXiv:2605.27365
Project Page: NVIDIA Research

Hardware Requirements

Optimized for NVIDIA GPUs (Ampere, Blackwell, Hopper, Lovelace) with BF16 precision and KV cache. Batch inference via la_flash backend reduces peak memory from 35GB to ~12GB on A100.

GitHub Repository: https://github.com/NVlabs/Eagle/tree/main/Embodied
HuggingFace Model: https://huggingface.co/nvidia/LocateAnything-3B
Online Demo: https://huggingface.co/spaces/nvidia/LocateAnything
Technical Report: https://arxiv.org/abs/2605.27365