Tutorials

Background

ICIP 2026 features a comprehensive program of tutorials delivered by distinguished experts from academia and industry. These sessions are designed to provide participants with a thorough understanding of fundamental principles as well as emerging trends in image and signal processing. Please consult the list of accepted tutorials below to plan your participation.


Presenters:
– Muhammad Haroon Yousaf, University of Engineering and Technology Taxila, Pakistan
– Junaid Mir, University of Engineering and Technology Taxila, Pakistan
– Shah Nawaz, Assistant Professor, Johannes Kepler University Linz, Austria

The rise of advanced generative audio and video technologies has heightened the need for secure identity verification systems that function effectively across various languages, devices, and modalities. This tutorial provides a structured overview of  multilingual face-voice biometrics, covering everything from foundational datasets to real-world applications. Key topics include cross-modal representation learning, challenges related to multilingual variability, and the evolving threat landscape from spoofing and deepfakes in audiovisual identity systems.

The tutorial begins with an in-depth look at face voice association mechanisms, illustrating how cross-modal embeddings capture identity-relevant information. It then examines the effects of multilingual and code-switched speech on biometric verification, supported by empirical data from datasets like MAV Celeb and MSS. Following this foundation, the tutorial presents benchmarking methodologies and evaluation paradigms, drawing insights from FAME 2024 and FAME 2026 (https://mavceleb.github.io/dataset/competition.html). It evaluates model performance variability in multilingual contexts and cross-domain shifts, offering a comprehensive understanding of current multimodal biometrics.  The final section focuses on anti-spoofing and security, reviewing various attack vectors such as replay attacks and deepfake threats. It also explores algorithmic countermeasures, from spectral features to advanced cross-modal fusion strategies, and discusses practical considerations for deployment in applications like remote identity verification and mobile authentication. By integrating advancements in multimodal learning, multilingual biometrics, benchmarking, and anti-spoofing strategies, this tutorial aims to equip participants with the knowledge and frameworks needed to develop robust audiovisual identity verification systems in the generative AI era.

Presenters:
– Jan Flusser, Institute of Information Theory and Automation, Czech Academy of Sciences
– Filip Šroubek, Institute of Information Theory and Automation, Czech Academy of Sciences

Image recognition has long been one of the central challenges in computer vision and visual AI. The main difficulty arises from intra-class variability – the fact that objects of the same category can differ significantly. These differences may result from physical variability (e.g., two cars of different models), appearance changes (the same car seen from different angles), or variations in imaging conditions such as illumination, blur, or color. A successful recognition system should therefore produce responses that are invariant, or at least robust, to these variations.

Variability in imaging conditions can be mitigated by image restoration algorithms, such as denoising, deblurring, dehazing, or super-resolution, but these methods are prone to artifacts and do not address all sources of variability. Before deep learning, handcrafted invariant features were widely used, but they are difficult to design for complex object categories. Deep neural networks such as CNNs and transformers achieve high accuracy but are not intrinsically invariant to common transformations and rely heavily on data augmentation.

This tutorial presents modern approaches to achieving invariance and equivariance by combining classical invariant representations with learned deep models. We cover hybrid invariant networks, handcrafted features embedded into neural architectures, and equivariant networks whose outputs follow predictable transformation rules. The tutorial emphasizes theoretical insight, practical design principles, and applications involving blur, illumination changes, and geometric transformations.

Presenters:
– Jianhui Chang, China Telecom
– Giuseppe Valenzise, Université Paris-Saclay
– Shiqi Wang, City University of Hong Kong

Generative visual coding is fundamentally reshaping the landscape of low- and ultra-low-bitrate compression. It shifts decoding from signal recovery to conditional synthesis, letting encoders send ultra-compact representations while generative modelsreconstruct perceptually consistent content. This tutorial provides a structured overview of this emerging domain and introduces a taxonomy of the most relevant approaches proposed so far.

Specifically, we begin by tracing the evolution from model-based coding and the unified rate distortion formulation to the integration of modern generative backbones, including VAEs, GANs, and diffusion models, into practical codec architectures. We focus on recent diffusion-driven compression methods, considering diverse methodologies where diffusion models function as (i) denoising based decoders linking bitrate to pseudo-timesteps; (ii) conditional enhancement modules guided by coarse reconstructions; (iii) ultra-low bitrate generators driven by compressed latent variables; and (iv) cross-modal synthesizers controlled by textual or spatial prompts. Furthermore, we address the critical bottleneck of inference speed by reviewing state-of-the-art acceleration strategies for sampling and decoding.

We illustrate how these ideas can be brought to standardization through Generative Face Video Coding (GFVC). Here, model-based coding has evolved into mature pipelines using compact facial representations and motion modeling, with layered/residual coding naturally supporting different operating points along the rate-perception distortion tradeoff. We highlight recent JVET milestones, particularly SEI messages for face video, showing how generative features ensure interoperability without normative changes to existing codecs.

Finally, we discuss open challenges, including computational complexity, model interpretability, and the need for novel quality assessment metrics, while envisioning the future of intelligent, bandwidth-efficient semantic visual communication.

Presenters:
– Fahad Sohrab, University of Eastern Finland & Tampere University, Finland
– Moncef Gabbouj, Tampere University, Finland

One-class classification (OCC) addresses learning scenarios in which training data are available from only a single target class, while samples from other classes are scarce, unknown, or prohibitively expensive to collect. Such settings arise frequently in image processing and imaging-based analysis, including medical image interpretation, hyperspectral and remote sensing imagery, industrial visual inspection, and vision-based anomaly detection. Traditional OCC methods typically operate in the original feature space, where they often struggle with high-dimensional representations, limited discriminative power, and an inability to effectively exploit spatial, structural, or relational information inherent in the data.

This tutorial provides a comprehensive and in-depth overview of subspace learning-based OCC for imaging applications, a rapidly advancing paradigm that jointly learns compact, discriminative representations and data descriptions tailored to high-dimensional visual data. The tutorial will cover theoretical foundations, optimization frameworks, and recent advances, including joint optimization of subspaces and data descriptions, graph-embedded OCC formulations for image data, and extensions to multi-view scenarios.

Through imaging-centred case studies and hands-on demonstrations, participants will gain practical insights into applying these methods to challenging real-world problems such as early myocardial infarction detection from multi-view echocardiography, hyperspectral image analysis, and rare object or species identification from visual data. The tutorial bridges theory and practice, equipping attendees with both conceptual understanding and practical tools to address data scarcity and high-dimensional learning challenges in modern image processing and computer vision systems using state-of-the-art OCC techniques.

Presenter:
– Aline Roumy, Inria Center at Rennes University

DNA-based data storage is an emerging data storage solution that offers numerous advantages. It is extremely dense—approximately 10^5 times denser than current storage systems such as LTO tape—highly durable and requires no energy during storage. As such, it has the potential to address both the rapid growth of data volumes and the energy costs associated with data storage.

This presentation will address key questions such as: why is data compression necessary in DNA-based data storage, and how can data be compressed in this context? In particular, the talk will focus on the biochemical constraints that must be respected to enable DNA-based data storage, both from a theoretical perspective and in practical implementations. As an illustration, the JPEG-DNA standard will be presented as an example of how these principles can be applied in practice.

Presenters:
– Wen-Hsiao Peng, National Yang Ming Chiao Tung University, Taiwan
– Heming Sun, Yokohama National University, Japan

End-to-end learned image and video coding has emerged as a powerful alternative to traditional transform-based codecs, achieving rate-distortion performance that surpasses state-of-the-art standards such as H.266/VVC in various scenarios. Beyond compression efficiency, learned codecs offer increased flexibility, enabling new applications including perceptual coding and machine oriented visual compression. As a result, this topic has attracted significant attention across both the signal processing and computer vision communities, with rapid progress reported at major venues such as ICIP, ICASSP, CVPR, and ICCV.

Despite these advances, the extremely high computational complexity of learned image and video codecs remains largely unaddressed. Compared to traditional codecs, learned video decoders may require two to three orders of magnitude more MAC operations per pixel and often rely on floating point arithmetic, posing serious obstacles to practical deployment. Balancing rate, distortion, and computational complexity has therefore become a key open problem.

This tutorial provides a comprehensive overview of recent advances in end-to-end learned image and video coding, with a particular emphasis on rate-distortion-complexity trade-offs. From the algorithmic perspective, we review modern codec frameworks such as overfitted coding and conditional residual coding. From the system perspective, we discuss network quantization and buffering strategies, highlighting their impact on complexity, memory footprint, and cross-platform interoperability. The tutorial also covers standardization activities in JPEG and MPEG.

By bridging algorithm design and implementation considerations, this tutorial aims to equip researchers and practitioners with a holistic understanding of learned image and video coding and to inspire future research toward efficient, deployable neural codecs.