Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang, Xuesong Li, Jing Zhang

Australian National University, CSIRO

CVPR 2026

Core Idea

Affordance = Geometry x Interaction.

Geometry identifies action-supporting parts. Interaction tells which part an action engages. We probe both in VFMs and compose them without affordance training.

Mechanism pipeline combining DINOv3 geometry and Flux Kontext verb attention with NSS selection — DINOv3 geometry + Flux verb attention + NSS selection.

Problem

What makes an affordance region meaningful?

Fully supervised methods learn labeled regions. Weakly supervised methods learn action-object correlations. Both can predict heatmaps, but neither alone explains why a region is actionable.

Contrast between fully supervised and weakly supervised affordance learning paradigms

Decomposition

The paper splits affordance into two observable primitives.

Geometry

Geometry: where action can be supported.

Object parts, shape, and spatial structure define plausible support regions.

Interaction

Interaction: how action engages an object.

Verb-conditioned priors identify which part matters for a specific action.

Composition

Composition: whether the two primitives can produce affordance.

Training-free fusion tests whether geometry and interaction are actually composable.

Geometry Evidence

Geometry-aware VFMs expose action-relevant object parts.

DINO-style features reveal coherent part-level structure, while geometric awareness correlates with affordance performance. Geometry gives spatial support, but not the action by itself.

Comparison of geometric representations across DINOv3, CLIP, SAM, and SD

Depth and normal geometric awareness correlation with affordance mIoU — Geometry awareness tracks affordance segmentation performance.

Interaction Evidence

Generative VFMs encode verb-conditioned interaction priors.

Flux Kontext attention localizes plausible contact regions. The verb changes spatial attention, not just object semantics.

Flux Kontext verb attention visualizations for multiple objects and actions

Model comparison showing Flux Kontext verb-conditioned localization against other models — Flux Kontext has the strongest verb responsiveness in this comparison.

Mechanism

DINOv3 geometry and Flux interaction can be composed training-free.

DINOv3 extracts part-level geometric components.
Flux Kontext provides object attention and verb attention.
NSS measures which geometric component aligns with the verb prior.
The selected component becomes a verb-specific affordance mask.

Training-free mechanism pipeline for composing geometry and interaction cues

Results

Geometry sharpens interaction into affordance.

Interaction-only maps already carry affordance signal. Adding geometry makes the prediction sharper and more part-specific, improving KLD, SIM, and NSS without affordance training.

Qualitative affordance masks from geometry interaction fusion for different actions

KLD 1.825 -> 1.493

lower is better

SIM 0.271 -> 0.326

higher is better

NSS 1.050 -> 1.090

higher is better

Boundary

Where the current mechanism fails.

This is not proof of embodied causal understanding. It shows composable visual primitives, with failure modes when generative edits or object attention become unreliable.

Failure case showing a generated duplicate object disrupting affordance localization

Generative object duplication

The editing model can generate a new object instance, causing the interaction prior to drift away from the original object.

ROI contamination

In cluttered scenes, broad object attention can contaminate the DINO feature region and corrupt the final selection.

BibTeX

Citation

@inproceedings{zhang2026probing,
  title={Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models},
  author={Zhang, Qing and Li, Xuesong and Zhang, Jing},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2526--2536},
  year={2026}
}