Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl*,1, Jaewoo Jung*,3, Seungryong Kim3, Marc Pollefeys1, Sunghwan Hong1,2
* Equal Contribution
1ETH Zurich  •  2ETH AI Center  •  3KAIST AI
arXiv 2026
Teaser
Figure 1. As shown in (a) and (b), existing VLMs struggle to answer questions when visual evidence is fine-grained or exists in spatially disjoint regions. We propose a training-free method where we apply a query-based visual grounding method to discover relevant regions and provide these regions as additional image crops, improving performance in both challenging scenarios. (c) Model performance comparison across six benchmarks.

Abstract

Despite rapid progress, pretrained vision–language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model’s next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

Method

Our method operates entirely at test time and requires no additional training or external modules. We first backpropagate the entropy of the model’s next-token distribution to the visual token embeddings, producing an entropy-gradient map that highlights which image regions drive the model’s uncertainty. Because a single saliency map often collapses onto one dominant region and misses spatially disjoint evidence, we extract and rank multiple coherent regions of interest from the gradient map. Finally, an iterative zoom-and-reground loop refines the selected crops, regulated by a spatial-entropy stopping criterion that prevents over-cropping.

Method overview
Figure 2. Overview of our approach. (a) An entropy-gradient map identifies the initial region of interest. (b) Iterative refinement updates the crop until spatial entropy stops decreasing. (c) The final forward pass produces the answer from the refined crop.

Results

VLM Method TextVQAV*DocVQAPOPEInfoQAGQARWQA
Fine-grained image understanding General QA
LLaVA 1.5 Base model 46.2246.0722.3286.5522.2461.9848.76
+ ViCrop55.1747.6419.6387.2523.2660.9747.97
+ Ours52.7856.0233.7087.5622.3361.1548.24
vs. base+6.56+9.95+11.38+1.01+0.09−0.76−0.52
LLaVA 1.6 Base model 65.8057.5964.9487.8024.6664.1458.30
+ ViCrop68.6561.7851.4288.1828.1864.5456.99
+ Ours67.9673.3065.0789.3133.9363.9760.39
vs. base+2.16+15.71+0.13+1.51+9.27−0.17+2.09
InternVL 3.5 Base model 59.4747.6458.7384.0241.2258.0461.83
+ Ours74.2967.5379.5486.7053.7359.0164.71
vs. base+14.82+19.89+20.81+2.69+12.51+0.97+2.88
Qwen 2.5 VL Base model 80.7573.3090.8187.0069.0261.0167.84
+ Ours81.4586.9191.1688.4773.4359.4966.93
vs. base+0.70+13.61+0.35+1.47+4.41−1.52−0.91

Table 1. Quantitative results on standard reasoning benchmarks. Our training-free method consistently improves fine-grained image understanding across four VLM architectures.

Conclusion

We introduced a training-free, model-intrinsic visual grounding framework for pretrained VLMs by backpropagating the entropy of the next-token distribution to visual embeddings. Using uncertainty gradients as a decision-relevant signal and converting them into ranked regions of interest, our method retrieves evidence from spatially disjoint cues without auxiliary detectors or heuristic attention processing. To address fixed-resolution limitations, we further propose an iterative refinement loop guided by a spatial-entropy stopping criterion, enabling the model to acquire finer-grained evidence and recover overlooked regions at inference time. Extensive experiments across standard reasoning benchmarks and four VLM architectures show consistent improvements on evidence-critical tasks—particularly in high-resolution and document-centric settings—while producing more focused, query-conditioned localizations.

BibTeX

@article{entropygrounding2026, title = {Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models}, author = {Gropl, Marcel and Jung, Jaewoo and Kim, Seungryong and Pollefeys, Marc and Hong, Sunghwan}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2026} }