AI LabSeptember 2, 2025

Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation

VocAlign: A framework for source-free domain adaptation in open-vocabulary segmentation

Silvio Mazzucco

Carl Persson

Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation

BMVC 2025

Authors

Silvio Mazzucco^1,2*, Carl Persson¹*, Mattia Segu^1,2,3, Pier Luigi Dovesi¹, Federico Tombari^3,4, Luc Van Gool⁵, Matteo Poggi^1,6
(*joint first authors)

¹The Good AI Lab
²ETH Zurich
³Google
⁴Technical University of Munich
⁵INSAIT, Sofia University
⁶University of Bologna

Vision-Language Models (VLMs) have recently opened the door to open-vocabulary semantic segmentation – the ability to label every pixel in an image, even with classes the model has never explicitly seen during training. This is powerful, but it comes with a challenge: when these models are deployed in new environments, their performance often drops sharply. Why? Because the "language" of categories used during training does not always align with the vocabulary of the new domain – think of a "wall" in CityScapes vs. a "building wall" in COCO.

Traditionally, domain adaptation methods fix this by accessing the original training data. But for VLMs trained on huge, proprietary datasets, that's not possible. This is where our work comes in.

Our contribution: VocAlign

We introduce VocAlign, the first framework for source-free domain adaptation in open-vocabulary segmentation. Instead of relying on inaccessible source data, we adapt the model directly to its new environment through three key ideas:

Vocabulary Alignment
Enrich the model's vocabulary with additional concepts and synonyms, aligning the teacher's "language" with the target domain. This improves pseudo-labels and helps recover classes that would otherwise be misinterpreted.
Efficiency via LoRA
Adapt large VLMs using lightweight Low-Rank Adaptation (LoRA) modules, keeping compute overhead low while preserving the broad knowledge from pretraining.
Top-K Class Selection
Focus on only the most relevant classes per image rather than all classes at once. This reduces memory needs and surprisingly improves adaptation by reinforcing the strongest signals.

Why it matters

VocAlign bridges a critical gap: it allows VLMs to adapt to new domains without source data, without prohibitive compute, and without sacrificing their open-vocabulary flexibility. On benchmarks like CityScapes, VocAlign improves performance by over +6 mIoU, fully recovering previously unrecognizable classes such as "terrain" or "wall". It also generalizes to diverse datasets like ADE20K and Pascal Context, showing consistent gains.

Key Takeaway

Models often get lost in translation when moving between domains. VocAlign provides them with a richer dictionary and a more efficient adaptation path, setting a new standard for source-free adaptation in open-vocabulary segmentation.

Acknowledgments

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU), EuroCC National Competence Center Sweden (ENCCS) and the CINECA award under the ISCRA initiative for the availability of high-performance computing resources and support.

Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation

Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation

Authors

Paper

Code

Our contribution: VocAlign

Why it matters

Key Takeaway

Acknowledgments