Persons

Vasileios Psomas, Ph.D.

All publications

Composed Image Retrieval for Remote Sensing

  • DOI: 10.1109/IGARSS53475.2024.10642874
  • Link: https://doi.org/10.1109/IGARSS53475.2024.10642874
  • Department: Visual Recognition Group
  • Annotation:
    This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir.

Evaluation of Resource-Efficient Crater Detectors on Embedded Systems

  • Authors: Vellas, S., Vasileios Psomas, Ph.D., Karadima, K., Danopoulos, D., Paterakis, A., Lentaris, G., Soudris, D., Karantzalos, K.
  • Publication: IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium Proceedings. Piscataway: IEEE, 2024. ISSN 2153-6996. ISBN 979-8-3503-6033-2.
  • Year: 2024
  • DOI: 10.1109/IGARSS53475.2024.10642518
  • Link: https://doi.org/10.1109/IGARSS53475.2024.10642518
  • Department: Visual Recognition Group
  • Annotation:
    Real-time analysis of Martian craters is crucial for mission-critical operations, including safe landings and geological exploration. This work leverages the latest breakthroughs for on-the-edge crater detection aboard spacecraft. We rigorously benchmark several YOLO networks using a Mars craters dataset, analyzing their performance on embedded systems with a focus on optimization for low-power devices. We optimize this process for a new wave of cost-effective, commercial-off-the-shelf-based smaller satellites. Implementations on diverse platforms, including Google Coral Edge TPU, AMD Versal SoC VCK190, Nvidia Jetson Nano and Jetson AGX Orin, undergo a detailed trade-off analysis. Our findings identify optimal network-device pairings, enhancing the feasibility of crater detection on resource-constrained hardware and setting a new precedent for efficient and resilient extraterrestrial imaging. Code at: https://github.com/billpsomas/mars_crater_detection.

Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

  • Authors: Vasileios Psomas, Ph.D., Kakogeorgiou, I., Karantzalos, K., Avrithis, Y.
  • Publication: ICCV2023: Proceedings of the International Conference on Computer Vision. Piscataway: IEEE, 2023. p. 5327-5337. ISSN 1550-5499. ISBN 979-8-3503-0719-1.
  • Year: 2023
  • DOI: 10.1109/ICCV51070.2023.00493
  • Link: https://doi.org/10.1109/ICCV51070.2023.00493
  • Department: Visual Recognition Group
  • Annotation:
    Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. One could thus call SimPool universal. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. Code at: https://github.com/billpsomas/simpool.

Deep learning for downscaling remote sensing images: Fusion and super-resolution

  • Authors: Sdraka, M., Papoutsis, I., Vasileios Psomas, Ph.D., Vlachos, K., Ioannidis, K., Karantzalos, K., Gialampoukidis, I., Vrochidis, S.
  • Publication: IEEE Geoscience and Remote Sensing Magazine. 2022, ISSN 2168-6831.
  • Year: 2022
  • DOI: 10.1109/MGRS.2022.3171836
  • Link: https://doi.org/10.1109/MGRS.2022.3171836
  • Department: Visual Recognition Group
  • Annotation:
    The past few years have seen an accelerating integration of deep learning (DL) techniques into various remote sensing (RS) applications, highlighting their power to adapt and achieving unprecedented advancements. In the present review, we provide an exhaustive exploration of the DL approaches proposed specifically for the spatial downscaling of RS imagery. A key contribution of our work is the presentation of the major architectural components and models, metrics, and data sets available for this task as well as the construction of a compact taxonomy for navigating through the various methods. Furthermore, we analyze the limitations of the current modeling approaches and provide a brief discussion on promising directions for image enhancement, following the paradigm of general computer vision (CV) practitioners and researchers as a source of inspiration and constructive insight.

It takes two to tango: Mixup for deep metric learning

  • Authors: Venkataramanan, S., Vasileios Psomas, Ph.D., Kijak, E., Amsaleg, L., Karantzalos, K., Avrithis, Y.
  • Publication: The Tenth International Conference on Learning Representations. Massachusetts: OpenReview.net / University of Massachusetts, 2022.
  • Year: 2022
  • Department: Visual Recognition Group
  • Annotation:
    Metric learning involves learning a discriminative representation such that embeddings of similar classes are encouraged to be close, while embeddings of dissimilar classes are pushed far apart. State-of-the-art methods focus mostly on sophisticated loss functions or mining strategies. On the one hand, metric learning losses consider two or more examples at a time. On the other hand, modern data augmentation methods for classification consider two or more examples at a time. The combination of the two ideas is under-studied. In this work, we aim to bridge this gap and improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. This task is challenging because unlike classification, the loss functions used in metric learning are not additive over examples, so the idea of interpolating target labels is not straightforward. To the best of our knowledge, we are the first to investigate mixing both examples and target labels for deep metric learning. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix. We also introduce a new metric— utilization—to demonstrate that by mixing examples during training, we are exploring areas of the embedding space beyond the training classes, thereby improving representations. To validate the effect of improved representations, we show that mixing inputs, intermediate representations or embeddings along with target labels significantly outperforms state-of-the-art metric learning methods on four benchmark deep metric learning datasets. Code at: https://tinyurl.com/metrix-iclr.

OpenFilter: A Framework to Democratize Research Access to Social Media AR Filters

  • Authors: Riccio, P., Vasileios Psomas, Ph.D., Galati, F., Escolano, F., Hofmann, T., Oliver, N.
  • Publication: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Montreal: Neural Information Processing Society, 2022. p. 12491-12503. vol. 35. ISSN 1049-5258. ISBN 9781713871088.
  • Year: 2022
  • Department: Visual Recognition Group
  • Annotation:
    Augmented Reality or AR filters on selfies have become very popular on social media platforms for a variety of applications, including marketing, entertainment and aesthetics. Given the wide adoption of AR face filters and the importance of faces in our social structures and relations, there is increased interest by the scientific community to analyze the impact of such filters from a psychological, artistic and sociological perspective. However, there are few quantitative analyses in this area mainly due to a lack of publicly available datasets of facial images with applied AR filters. The proprietary, close nature of most social media platforms does not allow users, scientists and practitioners to access the code and the details of the available AR face filters. Scraping faces from these platforms to collect data is ethically unacceptable and should, therefore, be avoided in research. In this paper, we present OpenFilter, a flexible framework to apply AR filters available in social media platforms on existing large collections of human faces. Moreover, we share FairBeauty and B-LFW, two beautified versions of the publicly available FairFace and LFW datasets and we outline insights derived from the analysis of these beautified datasets.

What to Hide from Your Students: Attention-Guided Masked Image Modeling

  • Authors: Kakogeorgiou, I., Gidaris, S., Vasileios Psomas, Ph.D., Avrithis, Y., Bursuc, A., Karantzalos, K., Komodakis, N.
  • Publication: Computer Vision – ECCV 2022. Cham: Springer, 2022. p. 300-318. LNCS. vol. 13675. ISSN 1611-3349. ISBN 978-3-031-19784-0.
  • Year: 2022
  • DOI: 10.1007/978-3-031-20056-4_18
  • Link: https://doi.org/10.1007/978-3-031-20056-4_18
  • Department: Visual Recognition Group
  • Annotation:
    Transformers and masked language modeling are quickly being adopted and explored in computer vision as vision transformers and masked image modeling (MIM). In this work, we argue that image token masking differs from token masking in text, due to the amount and correlation of tokens in an image. In particular, to generate a challenging pretext task for MIM, we advocate a shift from random masking to informed masking. We develop and exhibit this idea in the context of distillation-based MIM, where a teacher transformer encoder generates an attention map, which we use to guide masking for the student. We thus introduce a novel masking strategy, called attention-guided masking (AttMask), and we demonstrate its effectiveness over random masking for dense distillation-based MIM as well as plain distillation-based self-supervised learning on classification tokens. We confirm that AttMask accelerates the learning process and improves the performance on a variety of downstream tasks. We provide the implementation code at https://github.com/gkakogeorgiou/attmask.

Responsible person Ing. Mgr. Radovan Suk