Entropy-Gradient Grounding

Teaser — **Figure 1.** As shown in (a) and (b), existing VLMs struggle to answer questions when visual evidence is fine-grained or exists in spatially disjoint regions. We propose a training-free method where we apply a query-based visual grounding method to discover relevant regions and provide these regions as additional image crops, improving performance in both challenging scenarios. (c) Model performance comparison across six benchmarks.

Abstract

Despite rapid progress, pretrained vision–language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model’s next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

Method

Our method operates entirely at test time and requires no additional training or external modules. We first backpropagate the entropy of the model’s next-token distribution to the visual token embeddings, producing an entropy-gradient map that highlights which image regions drive the model’s uncertainty. Because a single saliency map often collapses onto one dominant region and misses spatially disjoint evidence, we extract and rank multiple coherent regions of interest from the gradient map. Finally, an iterative zoom-and-reground loop refines the selected crops, regulated by a spatial-entropy stopping criterion that prevents over-cropping.

Results

VLM	Method	TextVQA	V*	DocVQA	POPE	InfoQA	GQA	RWQA
		Fine-grained image understanding					General QA
LLaVA 1.5	Base model	46.22	46.07	22.32	86.55	22.24	61.98	48.76
	+ ViCrop	55.17	47.64	19.63	87.25	23.26	60.97	47.97
	+ Ours	52.78	56.02	33.70	87.56	22.33	61.15	48.24
	vs. base	+6.56	+9.95	+11.38	+1.01	+0.09	−0.76	−0.52
LLaVA 1.6	Base model	65.80	57.59	64.94	87.80	24.66	64.14	58.30
	+ ViCrop	68.65	61.78	51.42	88.18	28.18	64.54	56.99
	+ Ours	67.96	73.30	65.07	89.31	33.93	63.97	60.39
	vs. base	+2.16	+15.71	+0.13	+1.51	+9.27	−0.17	+2.09
InternVL 3.5	Base model	59.47	47.64	58.73	84.02	41.22	58.04	61.83
	+ Ours	74.29	67.53	79.54	86.70	53.73	59.01	64.71
	vs. base	+14.82	+19.89	+20.81	+2.69	+12.51	+0.97	+2.88
Qwen 2.5 VL	Base model	80.75	73.30	90.81	87.00	69.02	61.01	67.84
	+ Ours	81.45	86.91	91.16	88.47	73.43	59.49	66.93
	vs. base	+0.70	+13.61	+0.35	+1.47	+4.41	−1.52	−0.91

Table 1. Quantitative results on standard reasoning benchmarks. Our training-free method consistently improves fine-grained image understanding across four VLM architectures.

**Figure 3.** Qualitative results with LLaVA 1.5. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 4.** Qualitative results with LLaVA 1.5. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 5.** Qualitative results with LLaVA 1.6. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 6.** Qualitative results with LLaVA 1.6. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 7.** Qualitative results with LLaVA 1.6. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 8.** Qualitative results with InternVL 3.5. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

**Figure 9.** Qualitative results with Qwen 2.5 VL. Red boxes indicate the region localised by our entropy-gradient map; the correct answer is shown in green.

Conclusion

We introduced a training-free, model-intrinsic visual grounding framework for pretrained VLMs by backpropagating the entropy of the next-token distribution to visual embeddings. Using uncertainty gradients as a decision-relevant signal and converting them into ranked regions of interest, our method retrieves evidence from spatially disjoint cues without auxiliary detectors or heuristic attention processing. To address fixed-resolution limitations, we further propose an iterative refinement loop guided by a spatial-entropy stopping criterion, enabling the model to acquire finer-grained evidence and recover overlooked regions at inference time. Extensive experiments across standard reasoning benchmarks and four VLM architectures show consistent improvements on evidence-critical tasks—particularly in high-resolution and document-centric settings—while producing more focused, query-conditioned localizations.

BibTeX

@article{entropygrounding2026, title = {Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models}, author = {Gropl, Marcel and Jung, Jaewoo and Kim, Seungryong and Pollefeys, Marc and Hong, Sunghwan}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2026} }