Abstract: Vision-language models are emerging as a new generalized interface for image processing, moving the field beyond separate pipelines for captioning, retrieval, recognition, and reasoning toward unified visual systems. These models consist of two core components: a language model for reasoning and a vision backbone that converts multimedia inputs into representations the language model can interpret. Vision backbones, or image foundation models, such as CLIP, DINOv2, SAM, and RADIO are shifting research and development toward generalized visual backbones that can support a wide range of image processing tasks with few-shot adaptation or minimal fine-tuning, while reducing the need for task-specific preprocessing. In this talk, I will discuss recent progress and remaining challenges in building such open models, with an emphasis on training and deployment efficiency. I will also highlight what remains unsolved for real-world deployment in robotics, autonomous vehicles, and general computer vision, including robustness, controllability, grounding, efficiency, and evaluation beyond closed benchmarks. The broader goal is to position VLMs not merely as a multimodal trend, but as a serious foundation for the next generation of generalized image processing systems.
Bio: Pavlo Molchanov received his PhD from Tampere University of Technology, Finland, in 2014 in the field of RADAR signal processing. During his studies, he was awarded the Nokia Foundation Scholarship, the GETA Graduate School grant, a Best Paper Award, and the EuRAD Young Researcher Award.
Since 2015, he has been with NVIDIA Research, where he is now a Research Director leading a deep learning efficiency team. His work focuses on LLMs and multimodal models, including research on model compression, NAS-like acceleration, novel architectures, and adaptive/conditional inference.
His earlier research has been widely deployed across NVIDIA platforms and technologies through advances in keypoint estimation, efficient vision backbones, and model optimization techniques. More recently, he has contributed to the design and compression of NVIDIA’s large-scale foundation models.
© Copyright 2025 IEEE – All rights reserved. A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.