Despite rapid progress, pretrained vision–language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model’s next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.
Our method operates entirely at test time and requires no additional training or external modules. We first backpropagate the entropy of the model’s next-token distribution to the visual token embeddings, producing an entropy-gradient map that highlights which image regions drive the model’s uncertainty. Because a single saliency map often collapses onto one dominant region and misses spatially disjoint evidence, we extract and rank multiple coherent regions of interest from the gradient map. Finally, an iterative zoom-and-reground loop refines the selected crops, regulated by a spatial-entropy stopping criterion that prevents over-cropping.
| VLM | Method | TextVQA | V* | DocVQA | POPE | InfoQA | GQA | RWQA |
|---|---|---|---|---|---|---|---|---|
| Fine-grained image understanding | General QA | |||||||
| LLaVA 1.5 | Base model | 46.22 | 46.07 | 22.32 | 86.55 | 22.24 | 61.98 | 48.76 |
| + ViCrop | 55.17 | 47.64 | 19.63 | 87.25 | 23.26 | 60.97 | 47.97 | |
| + Ours | 52.78 | 56.02 | 33.70 | 87.56 | 22.33 | 61.15 | 48.24 | |
| vs. base | +6.56 | +9.95 | +11.38 | +1.01 | +0.09 | −0.76 | −0.52 | |
| LLaVA 1.6 | Base model | 65.80 | 57.59 | 64.94 | 87.80 | 24.66 | 64.14 | 58.30 |
| + ViCrop | 68.65 | 61.78 | 51.42 | 88.18 | 28.18 | 64.54 | 56.99 | |
| + Ours | 67.96 | 73.30 | 65.07 | 89.31 | 33.93 | 63.97 | 60.39 | |
| vs. base | +2.16 | +15.71 | +0.13 | +1.51 | +9.27 | −0.17 | +2.09 | |
| InternVL 3.5 | Base model | 59.47 | 47.64 | 58.73 | 84.02 | 41.22 | 58.04 | 61.83 |
| + Ours | 74.29 | 67.53 | 79.54 | 86.70 | 53.73 | 59.01 | 64.71 | |
| vs. base | +14.82 | +19.89 | +20.81 | +2.69 | +12.51 | +0.97 | +2.88 | |
| Qwen 2.5 VL | Base model | 80.75 | 73.30 | 90.81 | 87.00 | 69.02 | 61.01 | 67.84 |
| + Ours | 81.45 | 86.91 | 91.16 | 88.47 | 73.43 | 59.49 | 66.93 | |
| vs. base | +0.70 | +13.61 | +0.35 | +1.47 | +4.41 | −1.52 | −0.91 | |
Table 1. Quantitative results on standard reasoning benchmarks. Our training-free method consistently improves fine-grained image understanding across four VLM architectures.
We introduced a training-free, model-intrinsic visual grounding framework for pretrained VLMs by backpropagating the entropy of the next-token distribution to visual embeddings. Using uncertainty gradients as a decision-relevant signal and converting them into ranked regions of interest, our method retrieves evidence from spatially disjoint cues without auxiliary detectors or heuristic attention processing. To address fixed-resolution limitations, we further propose an iterative refinement loop guided by a spatial-entropy stopping criterion, enabling the model to acquire finer-grained evidence and recover overlooked regions at inference time. Extensive experiments across standard reasoning benchmarks and four VLM architectures show consistent improvements on evidence-critical tasks—particularly in high-resolution and document-centric settings—while producing more focused, query-conditioned localizations.