Bc. Petr Ryšavý, MSc., Ph.D.

An Algorithm to Calculate the p-Value of the Monge-Elkan Distance

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
Publikace: Journal of Computational Biology. 2025, 32(8), 797-812. ISSN 1557-8666.
Rok: 2025

DOI: 10.1089/cmb.2024.0854
Odkaz: https://doi.org/10.1089/cmb.2024.0854
Pracoviště: Intelligent Data Analysis
Anotace:
The Monge-Elkan distance is a straightforward yet popular distance measure used to estimate the mutual similarity of two sets of objects. It was initially proposed in the field of databases, and it found broad usage in other fields. Nowadays, it is especially relevant to the analysis of new-generation sequencing data as it represents a measure of dissimilarity between genomes of two distinct organisms, particularly when applied to unassembled reads. This article provides an algorithm to calculate the p-value associated with the Monge-Elkan distance. Given the object-level null distribution, that is, the distribution of distances between independently and identically sampled objects such as reads, the method yields the null distribution of the Monge-Elkan distance, which in turn allows for calculating the p-value. We also demonstrate an application on sequencing data, where individual reads are compared by the Levenshtein distance.

Causal Learning in Biomedical Applications: Krebs Cycle as a Benchmark

Autoři: Xiaoyu He, MSc., Bc. Petr Ryšavý, MSc., Ph.D., Mgr. Jakub Mareček, Ph.D.,
Publikace: F1000Research. 2025, ISSN 2046-1402.
Rok: 2025

Pracoviště: Katedra počítačů, Centrum umělé inteligence, Intelligent Data Analysis

circGPAcorr: an integrative tool for functional annotation of circular RNAs using expression data

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., Anuarbekov, A., Dostálová Merkerová, M., doc. Ing. Jiří Kléma, Ph.D.,
Publikace: BioData Mining. 2025, 18(1), ISSN 1756-0381.
Rok: 2025

DOI: 10.1186/s13040-025-00468-3
Odkaz: https://doi.org/10.1186/s13040-025-00468-3
Pracoviště: Intelligent Data Analysis
Anotace:
Circular RNAs play a crucial role in cell development and serve as biomarkers in many diseases. Nevertheless, the function of many circular RNAs remains unknown. This function can be inferred from sponging and silencing interactions with micro RNAs and messenger RNAs. We recently proposed a network-based circRNA functional annotation tool, circGPA. However, validation data for RNA interactions are often sparse and predicted interactions contain many false positives. To address this issue, we propose an extended algorithm named circGPAcorr, which uses expression data to weight the interactions, resulting in more precise functional annotation. To assess the significance of the results, the p-value is calculated using reduction to circGPA, a generating-polynomial-based method. We show that the problem is #P-hard, and thus computationally difficult. The circGPAcorr algorithm is tested on publicly available myelodysplastic syndromes expression data, providing gene ontology annotations that align with the literature on myelodysplastic syndromes. At the same time, we demonstrate its performance in the circRNA-disease annotation task.

Joint Problems in Learning Multiple Dynamical Systems

Autoři: Niu, M., Xiaoyu He, MSc., Bc. Petr Ryšavý, MSc., Ph.D., Zhou, Q., Mgr. Jakub Mareček, Ph.D.,
Publikace: Joint Problems in Learning Multiple Dynamical Systems. Praha: CTU FEE. Department of Computer Science, 2025.
Rok: 2025

Pracoviště: Katedra počítačů, Centrum umělé inteligence, Intelligent Data Analysis

GPACDA – circRNA-Disease Association Prediction with Generating Polynomials

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., doc. Ing. Jiří Kléma, Ph.D., Merkerová, M.D.
Publikace: Bioinformatics and Biomedical Engineering. Springer, Cham, 2024. p. 33-48. Lecture Notes in Computer Science. vol. 14848 LNBI. ISSN 0302-9743. ISBN 978-3-031-64629-4.
Rok: 2024

DOI: 10.1007/978-3-031-64629-4_3
Odkaz: https://doi.org/10.1007/978-3-031-64629-4_3
Pracoviště: Intelligent Data Analysis
Anotace:
Circular RNA, a molecule with partially understood functions, has been implicated in various diseases. Therefore, there is a vast effort to predict associations between circular RNAs and diseases. In our recent study, we introduced circGPA, an algorithm that enables the annotation of circular RNAs with gene ontology terms through interactions with miRNAs and mRNAs. Recognizing the analytical similarity in predicting disease associations, we developed GPACDA, an extension of circGPA tailored for disease associations. The benefits of our methods include explainability, as the outputs are based on known interactions and associations, as well as the rigorous calculation of the p-value, which the circGPA algorithm can compute. We compared our method with two other tools, NCPCDA and DWNCPCDA, using a subset of the CDASOR dataset and showed that GPACDA overcomes its competitors in terms of true association ranks. Our method’s code and predictions are publicly accessible.

Expression of Circular RNAs in Myelodysplastic Neoplasms and their Association with Mutations in the Splicing Factor Gene SF3B1

Autoři: Trsova, I., Hrustincova, A., Krejcik, Z., Kundrat, D., Holoubek, A., Staflova, K., Janstova, L., Vanikova, S., Szikszai, K., doc. Ing. Jiří Kléma, Ph.D., Bc. Petr Ryšavý, MSc., Ph.D., Belickova, M., Kaisrlikova, M., Vesela, J., Cermak, J., Jonasova, A., Dostal, J., Fric, J., Musil, J., Dostalova Merkerova, M.
Publikace: Molecular Oncology. 2023, 17(12), 2565-2583. ISSN 1574-7891.
Rok: 2023

DOI: 10.1002/1878-0261.13486
Odkaz: https://doi.org/10.1002/1878-0261.13486
Pracoviště: Katedra počítačů, Intelligent Data Analysis
Anotace:
Mutations in the splicing factor 3b subunit 1 (SF3B1) gene are frequent in myelodysplastic neoplasms (MDS). Because the splicing process is involved in the production of circular RNAs (circRNAs), we investigated the impact of SF3B1 mutations on circRNA processing. Using RNA sequencing, we measured circRNA expression in CD34+ bone marrow MDS cells. We defined circRNAs deregulated in a heterogeneous group of MDS patients and described increased circRNA formation in higher-risk MDS. We showed that the presence of SF3B1 mutations did not affect the global production of circRNAs; however, deregulation of specific circRNAs was observed. Particularly, we demonstrated that strong upregulation of circRNAs processed from the zinc finger E-box binding homeobox 1 (ZEB1) transcription factor; this upregulation was exclusive to SF3B1-mutated patients and was not observed in those with mutations in other splicing factors or other recurrently mutated genes, or with other clinical variables. Furthermore, we focused on the most upregulated ZEB1-circRNA, hsa_circ_0000228, and, by its knockdown, we demonstrated that its expression is related to mitochondrial activity. Using microRNA analyses, we proposed miR-1248 as a direct target of hsa_circ_0000228. To conclude, we demonstrated that mutated SF3B1 leads to deregulation of ZEB1-circRNAs, potentially contributing to the defects in mitochondrial metabolism observed in SF3B1-mutated MDS.

Reference-free phylogeny from sequencing data

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
Publikace: BioData Mining. 2023, 2023(16), ISSN 1756-0381.
Rok: 2023

DOI: 10.1186/s13040-023-00329-x
Odkaz: https://doi.org/10.1186/s13040-023-00329-x
Pracoviště: Intelligent Data Analysis
Anotace:
Motivation Clustering of genetic sequences is one of the key parts of bioinformatics analyses. Resulting phylogenetic trees are beneficial for solving many research questions, including tracing the history of species, studying migration in the past, or tracing a source of a virus outbreak. At the same time, biologists provide more data in the raw form of reads or only on contig-level assembly. Therefore, tools that are able to process those data without supervision need to be developed. Results In this paper, we present a tool for reference-free phylogeny capable of handling data where no mature-level assembly is available. The tool allows distance calculation for raw reads, contigs, and the combination of the latter. The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms. Compared to the previous research, the novelty of the method lies in a newly proposed combination of the read and contig measures, a new method for read-contig mapping, and an efficient embedding of contigs.

circGPA: circRNA functional annotation based on probability-generating functions

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., doc. Ing. Jiří Kléma, Ph.D., Merkerová, M.
Publikace: BMC Bioinformatics. 2022, 2022(23), 392-414. ISSN 1471-2105.
Rok: 2022

DOI: 10.1186/s12859-022-04957-8
Odkaz: https://doi.org/10.1186/s12859-022-04957-8
Pracoviště: Intelligent Data Analysis
Anotace:
Recent research has already shown that circular RNAs (circRNAs) are functional in gene expression regulation and potentially related to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. However, the function of most circRNAs remains unknown, and it is expensive and time-consuming to discover it through biological experiments. In this paper, we predict circRNA annotations from the knowledge of their interaction with miRNAs and subsequent miRNA–mRNA interactions. First, we construct an interaction network for a target circRNA and secondly spread the information from the network nodes with the known function to the root circRNA node. This idea itself is not new; our main contribution lies in proposing an efficient and exact deterministic procedure based on the principle of probability-generating functions to calculate the p-value of association test between a circRNA and an annotation term. We show that our publicly available algorithm is both more effective and efficient than the commonly used Monte-Carlo sampling approach that may suffer from difficult quantification of sampling convergence and subsequent sampling inefficiency. We experimentally demonstrate that the new approach is two orders of magnitude faster than the Monte-Carlo sampling, which makes summary annotation of large circRNA files feasible; this includes their reannotation after periodical interaction network updates, for example. We provide a summary annotation of a current circRNA database as one of our outputs. The proposed algorithm could be generalized towards other types of RNA in way that is straightforward.

Approximate search in genomic data

Autoři: Bc. Petr Ryšavý, MSc., Ph.D.,
Publikace: Proceedings of the International Student Scientific Conference Poster – 23/2019. Praha: ČVUT FEL, Středisko vědecko-technických informací, 2019. p. 155-159. 1. vol. 1. ISBN 978-80-01-06581-5.
Rok: 2019

Pracoviště: Katedra počítačů, Intelligent Data Analysis
Anotace:
Genomic data are becoming one of the mostbulk data in the world. Therefore, there is a need for effi-cient manipulation and mining of those data. However, withthe advances in technology, the researchers often publish se-quencing data in the raw form of reads. In this paper, weevaluate the effects of replacing a sequence with read da-ta on a search under the Levenshtein distance. Namely, weprovide several experiments that show how well the searchworks for read data under various conditions.

Estimating sequence similarity from read sets for clustering next-generation sequencing data

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
Publikace: Data Mining and Knowledge Discovery. 2019, 33(1), 1-23. ISSN 1384-5810.
Rok: 2019

DOI: 10.1007/s10618-018-0584-8
Odkaz: https://doi.org/10.1007/s10618-018-0584-8
Pracoviště: Katedra počítačů, Intelligent Data Analysis
Anotace:
Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences; rather they identify a large number of small substrings called reads, sampled at random places of the target sequence. To estimate similarity of two sequences from their read-set representations, one may try to reconstruct each one first from its read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. Due to the nature of data, sequence assembly often cannot provide a single putative sequence that matches the true DNA. Therefore, we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases, avoiding the sequence assembly step. For low-coverage (i.e. small) read set samples, it yields a better approximation of the true sequence similarities. This in turn results in better clustering in comparison to the first-assemble-then-cluster approach. Put differently, for a fixed estimation accuracy, our approach requires smaller read sets and thus entails reduced wet-lab costs.

On Sequence Overlaps Minimizing Post-normalized Edit Distance

Autoři: Bc. Petr Ryšavý, MSc., Ph.D.,
Publikace: Proceedings of the International Student Scientific Conference Poster – 22/2018. Praha: Czech Technical University in Prague, 2018. p. 1-5. ISBN 978-80-01-06428-3.
Rok: 2018

Pracoviště: Katedra počítačů, Intelligent Data Analysis
Anotace:
Computational biology deals with processing of genetic sequences. Next-generation sequencing machines produce sets of short substrings of a DNA sequence, called reads, which are further assembled to contigs. Recently there has been proposed a method how to measure the distance between two sequences only from an assembly at the level of contigs. The original paper used a heuristic for finding a suitable overlap of two contigs. The heuristic was, however, not evaluated on its own in the original paper. Here, we address this issue, and our results show, that the relative error of the heuristic is small and that the heuristic provides the exact result in more than two-thirds of the cases.

Estimating Sequence Similarity from Contig Sets

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
Publikace: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Düsseldorf: Springer VDI Verlag, 2017. p. 272-283. ISSN 0302-9743. ISBN 978-3-319-68764-3.
Rok: 2017

DOI: 10.1007/978-3-319-68765-0_23
Odkaz: https://doi.org/10.1007/978-3-319-68765-0_23
Pracoviště: Katedra počítačů
Anotace:
A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

Using Tries for Evaluating Monge-Elkan Distance on Genomic Sequences

Autoři: Bc. Petr Ryšavý, MSc., Ph.D.,
Publikace: Proceedings of the International Student Scientific Conference Poster – 21/2017. Praha: Czech Technical University in Prague, 2017. p. 1-5. ISBN 978-80-01-06153-4.
Rok: 2017

Pracoviště: Katedra počítačů
Anotace:
Processing of genetic sequences become a popu- lar task in past years. Next-generation sequencing machines produce sets of short substrings of a DNA sequence, called reads. Recently there has been proposed a method how to measure distance between read sets using Monge-Elkan sim- ilarity known from the field of databases. In this paper we study applicability of tries on Monge-Elkan similarity. Tries have been used with success for improving runtime of dictio- nary search and similarity joins, which are problems sim- ilar to evaluating Monge-Elkan similarity. We show that our approach outperforms the straightforward evaluation of Monge-Elkan similarity for very short reads. However the speedup achieved is smaller than on dictionary search and similarity joins.

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Autoři: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
Publikace: ADVANCES IN INTELLIGENT DATA ANALYSIS XV - Lecture Notes in Computer Science. Wien: Springer, 2016. pp. 204-214. ISSN 0302-9743. ISBN 978-3-319-46348-3.
Rok: 2016

DOI: 10.1007/978-3-319-46349-0_18
Odkaz: https://doi.org/10.1007/978-3-319-46349-0_18
Pracoviště: Katedra počítačů
Anotace:
Clustering biological sequences is a central task in bioinformatics. The typical result of new-generation sequencers is a set of short substrings (“reads”) of a target sequence, rather than the sequence itself. To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. It avoids the NP-hard problem of sequence assembly and in empirical experiments it results in a better approximation of the true sequence similarities and consequently in better clustering, in comparison to the first-assemble-then-cluster approach.

Bc. Petr Ryšavý, MSc., Ph.D.

Všechny publikace

An Algorithm to Calculate the p-Value of the Monge-Elkan Distance

Causal Learning in Biomedical Applications: Krebs Cycle as a Benchmark

circGPAcorr: an integrative tool for functional annotation of circular RNAs using expression data

Joint Problems in Learning Multiple Dynamical Systems

GPACDA – circRNA-Disease Association Prediction with Generating Polynomials

Expression of Circular RNAs in Myelodysplastic Neoplasms and their Association with Mutations in the Splicing Factor Gene SF3B1

Reference-free phylogeny from sequencing data

circGPA: circRNA functional annotation based on probability-generating functions

Approximate search in genomic data

Estimating sequence similarity from read sets for clustering next-generation sequencing data

On Sequence Overlaps Minimizing Post-normalized Edit Distance

Estimating Sequence Similarity from Contig Sets

Using Tries for Evaluating Monge-Elkan Distance on Genomic Sequences

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Mějte přehled