Georgios Kordopatis-Zilos, Ph.D.

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-Level Retrieval

Autoři: Ing. Pavel Šuma, Georgios Kordopatis-Zilos, Ph.D., Iscen, A., doc. Georgios Tolias, Ph.D.,
Publikace: Computer Vision – ECCV 2024, Part LIX. Springer, Cham, 2025. p. 307-325. LNCS. vol. 15117. ISSN 0302-9743. ISBN 978-3-031-73201-0.
Rok: 2025

DOI: 10.1007/978-3-031-73202-7_18
Odkaz: https://doi.org/10.1007/978-3-031-73202-7_18
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
This work investigates the problem of instance-level image retrieval re-ranking with the constraint of memory efficiency, ultimately aiming to limit memory usage to 1KB per image. Departing from the prevalent focus on performance enhancements, this work prioritizes the crucial trade-off between performance and memory requirements. The proposed model uses a transformer-based architecture designed to estimate image-to-image similarity by capturing interactions within and across images based on their local descriptors. A distinctive property of the model is the capability for asymmetric similarity estimation. Database images are represented with a smaller number of descriptors compared to query images, enabling performance improvements without increasing memory consumption. To ensure adaptability across different applications, a universal model is introduced that adjusts to a varying number of local descriptors during the testing phase. Results on standard benchmarks demonstrate the superiority of our approach over both hand-crafted and learned models. In particular, compared with current state-of-the-art methods that overlook their memory footprint, our approach not only attains superior performance but does so with a significantly reduced memory footprint. The code and pretrained models are publicly available at: https://github.com/pavelsuma/ames

Fusion Transformer with Object Mask Guidance for Image Forgery Analysis

Autoři: Karageorgiou, D., Georgios Kordopatis-Zilos, Ph.D., Papadopoulos, S.
Publikace: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Los Alamitos: IEEE Computer Society, 2024. p. 4345-4355. ISSN 2160-7516. ISBN 979-8-3503-6547-4.
Rok: 2024

DOI: 10.1109/CVPRW63382.2024.00438
Odkaz: https://doi.org/10.1109/CVPRW63382.2024.00438
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
In this work we introduce OMG-Fuser a fusion transformer-based network designed to extract information from various forensic signals to enable robust image forgery detection and localization. Our approach can operate with an arbitrary number of forensic signals and leverages object information for their analysis -- unlike previous methods that rely on fusion schemes with few signals and often disregard image semantics. To this end we design a forensic signal stream composed of a transformer guided by an object attention mechanism associating patches that depict the same objects. In that way we incorporate object-level information from the image. Each forensic signal is processed by a different stream that adapts to its peculiarities. A token fusion transformer efficiently aggregates the outputs of an arbitrary number of network streams and generates a fused representation for each image patch. % These representations are finally processed by a long-range dependencies transformer that captures the intrinsic relations between the image patches. We assess two fusion variants on top of the proposed approach: (i) score-level fusion that fuses the outputs of multiple image forensics algorithms and (ii) feature-level fusion that fuses low-level forensic traces directly. Both variants exceed state-of-the-art performance on seven datasets for image forgery detection and localization with a relative average improvement of 12.1% and 20.4% in terms of F1. Our model is robust against traditional and novel forgery attacks and can be expanded with new signals without training from scratch. Our code is publicly available at: https://github.com/mever-team/omgfuser

MAD '24Workshop: Multimedia AI against Disinformation

Autoři: Stanciu, C., Ionescu, B., Cuccovillo, L., Papadopoulos, S., Georgios Kordopatis-Zilos, Ph.D.,
Publikace: 3rd ACM International Workshop on Multimedia AI against Disinformation (MAD '24). New York: ACM, 2024. p. 1339-1341. ISBN 979-8-4007-0602-8.
Rok: 2024

DOI: 10.1145/3652583.3660000
Odkaz: https://doi.org/10.1145/3652583.3660000
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
Synthetic media generation and manipulation have seen rapid advancements in recent years, making it increasingly easy to create multimedia content that is indistinguishable to the human observer. Moreover, generated content can be used maliciously by individuals and organizations in order to spread disinformation, posing a significant threat to society and democracy. Hence, there is an urgent need for AI tools geared towards facilitating a timely and effective media verification process. The MAD'24 workshop seeks to bring together people with diverse backgrounds who are dedicated to combating disinformation in multimedia through the means of AI, by fostering an environment for exploring innovative ideas and sharing experiences. The research areas of interest encompass the identification of manipulated or generated content, along with the investigation of the dissemination of disinformation and its societal repercussions. Recognizing the significance of multimedia, the workshop emphasizes the joint analysis of various modalities within content, as verification can be improved by aggregating multiple forms of content.

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

Autoři: Coccomini, D. A., Georgios Kordopatis-Zilos, Ph.D., Amato, G., Caldelli, R.
Publikace: IEEE Transactions on Information Forensics and Security. 2024, 19 6084-6096. ISSN 1556-6013.
Rok: 2024

DOI: 10.1109/TIFS.2024.3409054
Odkaz: https://doi.org/10.1109/TIFS.2024.3409054
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.

The 2023 video similarity dataset and challenge

Autoři: Pizzi, E., Georgios Kordopatis-Zilos, Ph.D., Patel, H., Postelnicu, G., doc. Georgios Tolias, Ph.D.,
Publikace: Computer Vision and Image Understanding. 2024, 243 ISSN 1077-3142.
Rok: 2024

DOI: 10.1016/j.cviu.2024.103997
Odkaz: https://doi.org/10.1016/j.cviu.2024.103997
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
This work introduces a dataset, benchmark, and challenge for the problem of video copy tracing. There are two related tasks: determining whether a query video shares content with a reference video ("detection") and temporally localizing the shared content within each video ("localization"). The benchmark is designed to evaluate methods on these two tasks. It simulates a realistic needle -in -haystack setting, where the majority of both query and reference videos are "distractors"containing no copied content. We propose an accuracy metric for both tasks. The associated challenge imposes computing resource restrictions that reflect realworld settings. We also analyze the results and methods of the top submissions to the challenge. The dataset, baseline methods, and evaluation code are publicly available and were discussed at the Visual Copy Detection Workshop (VCDW) at CVPR'23. We provide reference code for evaluation and baselines at: https://github.com/facebookresearch/vsc2022.

The Visual Saliency Transformer Goes Temporal: TempVST for Video Saliency Prediction

Autoři: Lazaridis, N., Georgiadis, K., Kalaganis, F., Georgios Kordopatis-Zilos, Ph.D.,
Publikace: IEEE Access. 2024, 12 129705-129716. ISSN 2169-3536.
Rok: 2024

DOI: 10.1109/ACCESS.2024.3436585
Odkaz: https://doi.org/10.1109/ACCESS.2024.3436585
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
The Transformer revolutionized Natural Language Processing and Computer Vision by effectively capturing contextual relationships in sequential data through its attention mechanism. While Transformers have been explored sufficiently in traditional computer vision tasks such as image classification, their application to more intricate tasks, such as Video Saliency Prediction (VSP), remains limited. Video saliency prediction is the task of identifying the most visually salient regions in a video, which are likely to capture a viewer's attention. In this study, we propose a pure transformer architecture named Temporal Visual Saliency Transformer (TempVST) for the VSP task. Our model leverages the Visual Saliency Transformer (VST) as a backbone, with the addition of a Transformer-based temporal module that can seamlessly transition diverse architectural frameworks from image to video domain, through the incorporation of temporal recurrences. Moreover, we demonstrate that transfer learning is viable in the context of VSP through Transformer architectures and helps reduce the duration of the training phase, leading to a reduction in the duration of the training phase by 41% and 45% in two different datasets. Our experiments were conducted on two benchmark datasets, DHF1K and LEDOV, and our results show that our network can compete with all other state-of-the-art models.

Improving Synthetically Generated Image Detection in Cross-Concept Settings

Autoři: Dogoulis, P., Georgios Kordopatis-Zilos, Ph.D., Kompatsiaris, I., Papadopoulos, S.
Publikace: 2nd ACM International Workshop on Multimedia AI against Disinformation (MAD '23). New York: ACM, 2023. p. 28-35. ISBN 979-8-4007-0178-8.
Rok: 2023

DOI: 10.1145/3592572.3592846
Odkaz: https://doi.org/10.1145/3592572.3592846
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images - highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.

MAD '23 Workshop: Multimedia AI against Disinformation

Autoři: Cuccovillo, L., Ionescu, B., Georgios Kordopatis-Zilos, Ph.D., Papadopoulos, S.
Publikace: 2nd ACM International Workshop on Multimedia AI against Disinformation (MAD '23). New York: ACM, 2023. p. 676-677. ISBN 979-8-4007-0178-8.
Rok: 2023

DOI: 10.1145/3591106.3592303
Odkaz: https://doi.org/10.1145/3591106.3592303
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD '23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their joint analysis can boost the performance of verification methods.

Self-Supervised Video Similarity Learning

Autoři: Georgios Kordopatis-Zilos, Ph.D., doc. Georgios Tolias, Ph.D., Tzelepis, C., Kompatsiaris, I., Patras, I., Papadopoulos, S.
Publikace: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Whorkshops (CVPRW). USA: IEEE Computer Society, 2023. p. 4756-4766. ISSN 2160-7516. ISBN 979-8-3503-0249-3.
Rok: 2023

DOI: 10.1109/CVPRW59228.2023.00504
Odkaz: https://doi.org/10.1109/CVPRW59228.2023.00504
Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
We introduce S2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs

Test-time Training for Matching-based Video Object Segmentation

Autoři: Bertrand, J., Georgios Kordopatis-Zilos, Ph.D., Kalantidis, Y., doc. Georgios Tolias, Ph.D.,
Publikace: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Montreal: Neural Information Processing Society, 2023. vol. 36. ISSN 1049-5258.
Rok: 2023

Pracoviště: Skupina vizuálního rozpoznávání
Anotace:
The video object segmentation (VOS) task involves the segmentation of an object over time based on a single initial mask. Current state-of-the-art approaches use a memory of previously processed frames and rely on matching to estimate segmentation masks of subsequent frames. Lacking any adaptation mechanism, such methods are prone to test-time distribution shifts. This work focuses on matching-based VOS under distribution shifts such as video corruptions, stylization, and sim-to-real transfer. We explore test-time training strategies that are agnostic to the specific task as well as strategies that are designed specifically for VOS. This includes a variant based on mask cycle consistency tailored to matching-based VOS methods. The experimental results on common benchmarks demonstrate that the proposed test-time training yields significant improvements in performance. In particular for the sim-to-real scenario and despite using only a single test video, our approach manages to recover a substantial portion of the performance gain achieved through training on real videos. Additionally, we introduce DAVIS-C, an augmented version of the popular DAVIS test set, featuring extreme distribution shifts like image-/video-level corruptions and stylizations. Our results illustrate that test-time training enhances performance even in these challenging cases.

Georgios Kordopatis-Zilos, Ph.D.

Všechny publikace

AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-Level Retrieval

Fusion Transformer with Object Mask Guidance for Image Forgery Analysis

MAD '24Workshop: Multimedia AI against Disinformation

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

The 2023 video similarity dataset and challenge

The Visual Saliency Transformer Goes Temporal: TempVST for Video Saliency Prediction

Improving Synthetically Generated Image Detection in Cross-Concept Settings

MAD '23 Workshop: Multimedia AI against Disinformation

Self-Supervised Video Similarity Learning

Test-time Training for Matching-based Video Object Segmentation

Mějte přehled