Lidé

Bc. Petr Ryšavý, MSc.

Všechny publikace

circGPA: circRNA functional annotation based on probability-generating functions

  • DOI: 10.1186/s12859-022-04957-8
  • Odkaz: https://doi.org/10.1186/s12859-022-04957-8
  • Pracoviště: Intelligent Data Analysis
  • Anotace:
    Recent research has already shown that circular RNAs (circRNAs) are functional in gene expression regulation and potentially related to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. However, the function of most circRNAs remains unknown, and it is expensive and time-consuming to discover it through biological experiments. In this paper, we predict circRNA annotations from the knowledge of their interaction with miRNAs and subsequent miRNA–mRNA interactions. First, we construct an interaction network for a target circRNA and secondly spread the information from the network nodes with the known function to the root circRNA node. This idea itself is not new; our main contribution lies in proposing an efficient and exact deterministic procedure based on the principle of probability-generating functions to calculate the p-value of association test between a circRNA and an annotation term. We show that our publicly available algorithm is both more effective and efficient than the commonly used Monte-Carlo sampling approach that may suffer from difficult quantification of sampling convergence and subsequent sampling inefficiency. We experimentally demonstrate that the new approach is two orders of magnitude faster than the Monte-Carlo sampling, which makes summary annotation of large circRNA files feasible; this includes their reannotation after periodical interaction network updates, for example. We provide a summary annotation of a current circRNA database as one of our outputs. The proposed algorithm could be generalized towards other types of RNA in way that is straightforward.

Approximate search in genomic data

  • Autoři: Bc. Petr Ryšavý, MSc.,
  • Publikace: Proceedings of the International Student Scientific Conference Poster – 23/2019. Praha: ČVUT FEL, Středisko vědecko-technických informací, 2019. p. 155-159. 1. vol. 1. ISBN 978-80-01-06581-5.
  • Rok: 2019
  • Pracoviště: Katedra počítačů, Intelligent Data Analysis
  • Anotace:
    Genomic data are becoming one of the mostbulk data in the world. Therefore, there is a need for effi-cient manipulation and mining of those data. However, withthe advances in technology, the researchers often publish se-quencing data in the raw form of reads. In this paper, weevaluate the effects of replacing a sequence with read da-ta on a search under the Levenshtein distance. Namely, weprovide several experiments that show how well the searchworks for read data under various conditions.

Estimating sequence similarity from read sets for clustering next-generation sequencing data

  • DOI: 10.1007/s10618-018-0584-8
  • Odkaz: https://doi.org/10.1007/s10618-018-0584-8
  • Pracoviště: Katedra počítačů, Intelligent Data Analysis
  • Anotace:
    Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences; rather they identify a large number of small substrings called reads, sampled at random places of the target sequence. To estimate similarity of two sequences from their read-set representations, one may try to reconstruct each one first from its read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. Due to the nature of data, sequence assembly often cannot provide a single putative sequence that matches the true DNA. Therefore, we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases, avoiding the sequence assembly step. For low-coverage (i.e. small) read set samples, it yields a better approximation of the true sequence similarities. This in turn results in better clustering in comparison to the first-assemble-then-cluster approach. Put differently, for a fixed estimation accuracy, our approach requires smaller read sets and thus entails reduced wet-lab costs.

On Sequence Overlaps Minimizing Post-normalized Edit Distance

  • Autoři: Bc. Petr Ryšavý, MSc.,
  • Publikace: Proceedings of the International Student Scientific Conference Poster – 22/2018. Praha: Czech Technical University in Prague, 2018. p. 1-5. ISBN 978-80-01-06428-3.
  • Rok: 2018
  • Pracoviště: Katedra počítačů, Intelligent Data Analysis
  • Anotace:
    Computational biology deals with processing of genetic sequences. Next-generation sequencing machines produce sets of short substrings of a DNA sequence, called reads, which are further assembled to contigs. Recently there has been proposed a method how to measure the distance between two sequences only from an assembly at the level of contigs. The original paper used a heuristic for finding a suitable overlap of two contigs. The heuristic was, however, not evaluated on its own in the original paper. Here, we address this issue, and our results show, that the relative error of the heuristic is small and that the heuristic provides the exact result in more than two-thirds of the cases.

Estimating Sequence Similarity from Contig Sets

  • Autoři: Bc. Petr Ryšavý, MSc., prof. Ing. Filip Železný, Ph.D.,
  • Publikace: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Düsseldorf: Springer VDI Verlag, 2017. p. 272-283. ISSN 0302-9743. ISBN 978-3-319-68764-3.
  • Rok: 2017
  • DOI: 10.1007/978-3-319-68765-0_23
  • Odkaz: https://doi.org/10.1007/978-3-319-68765-0_23
  • Pracoviště: Katedra počítačů
  • Anotace:
    A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

Using Tries for Evaluating Monge-Elkan Distance on Genomic Sequences

  • Autoři: Bc. Petr Ryšavý, MSc.,
  • Publikace: Proceedings of the International Student Scientific Conference Poster – 21/2017. Praha: Czech Technical University in Prague, 2017. p. 1-5. ISBN 978-80-01-06153-4.
  • Rok: 2017
  • Pracoviště: Katedra počítačů
  • Anotace:
    Processing of genetic sequences become a popu- lar task in past years. Next-generation sequencing machines produce sets of short substrings of a DNA sequence, called reads. Recently there has been proposed a method how to measure distance between read sets using Monge-Elkan sim- ilarity known from the field of databases. In this paper we study applicability of tries on Monge-Elkan similarity. Tries have been used with success for improving runtime of dictio- nary search and similarity joins, which are problems sim- ilar to evaluating Monge-Elkan similarity. We show that our approach outperforms the straightforward evaluation of Monge-Elkan similarity for very short reads. However the speedup achieved is smaller than on dictionary search and similarity joins.

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

  • DOI: 10.1007/978-3-319-46349-0_18
  • Odkaz: https://doi.org/10.1007/978-3-319-46349-0_18
  • Pracoviště: Katedra počítačů
  • Anotace:
    Clustering biological sequences is a central task in bioinformatics. The typical result of new-generation sequencers is a set of short substrings (“reads”) of a target sequence, rather than the sequence itself. To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. It avoids the NP-hard problem of sequence assembly and in empirical experiments it results in a better approximation of the true sequence similarities and consequently in better clustering, in comparison to the first-assemble-then-cluster approach.

Za stránku zodpovídá: Ing. Mgr. Radovan Suk