Persons

prof. Ing. Filip Železný, Ph.D.

All publications

Automatic Conjecturing of P-Recursions Using Lifted Inference

  • DOI: 10.1007/978-3-030-97454-1_2
  • Link: https://doi.org/10.1007/978-3-030-97454-1_2
  • Department: Intelligent Data Analysis
  • Annotation:
    Recent progress in lifted inference algorithms has made it possible to solve many non-trivial counting tasks from enumerative combinatorics in an automated fashion, by casting them as first-order model counting problems. Algorithms for this problem typically output a single number, which is the number of models of the first-order logic sentence in question on a given domain. However, in the combinatorics setting, we are more interested in obtaining a mathematical formula that holds for any given structure size. In this paper, we show that one can use lifted inference algorithms to conjecture linear recurrences with polynomial coefficients, one such class of formulas of interest.

Forty years of score-based soccer match outcome prediction: an experimental review

  • DOI: 10.1093/imaman/dpab029
  • Link: https://doi.org/10.1093/imaman/dpab029
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We investigate the state-of-the-art in score-based soccer match outcome modelling to identify the top-performing methods across diverse classes of existing approaches to the problem. Namely, we bring together various statistical methods based on Poisson and Weibull distributions and several general ranking algorithms (Elo, Steph ratings, Gaussian-OD ratings) as well as domain-specific rating systems (Berrar ratings, pi-ratings). We review, reimplement and experimentally compare these diverse competitors altogether on the largest database of soccer results available to identify true leaders. Our results reveal that the individual predictions, as well as the overall performances, are very similar across the top models tested, likely suggesting the limits of this generic approach to score-based match outcome modelling. No study of a similar scale has previously been done.

Learning to Generate Molecules From Small Datasets Using Neural Markov Logic Networks

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    Neural Markov Logic networks are a statistical relational model capable of generating relational structures. In this paper, we investigate how this particular model behaves in the setup of few-shot learning and show that Neural Markov Logic Networks are able to learn to generate small molecules from a handful of training examples without any pre-training.

Learning with Molecules beyond Graph Neural Networks

  • Department: Intelligent Data Analysis
  • Annotation:
    We demonstrate a deep learning framework which is inherently based in the highly expressive language of relational logic, enabling to, among other things, capture arbitrarily complex graph structures. We show how Graph Neural Networks and similar models can be easily covered in the framework by specifying the underlying propagation rules in the relational logic. The declarative nature of the used language then allows to easily modify and extend the existing propagation schemes into more complex structures, such as atom rings in molecules, which we choose for a short demonstration in this work

Beyond graph neural networks with lifted relational neural networks

  • DOI: 10.1007/s10994-021-06017-3
  • Link: https://doi.org/10.1007/s10994-021-06017-3
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We introduce a declarative differentiable programming framework, based on the language of Lifted Relational Neural Networks, where small parameterized logic programs are used to encode deep relational learning scenarios through the underlying symmetries. When presented with relational data, such as various forms of graphs, the logic program interpreter dynamically unfolds differentiable computation graphs to be used for the program parameter optimization by standard means. Following from the declarative, relational logic-based encoding, this results into a unified representation of a wide range of neural models in the form of compact and elegant learning programs, in contrast to the existing procedural approaches operating directly on the computational graph level. We illustrate how this idea can be used for a concise encoding of existing advanced neural architectures, with the main focus on Graph Neural Networks (GNNs). Importantly, using the framework, we also show how the contemporary GNN models can be easily extended towards higher expressiveness in various ways. In the experiments, we demonstrate correctness and computation efficiency through comparison against specialized GNN frameworks, while shedding some light on the learning performance of the existing GNN models.

Lossless Compression of Structured Convolutional Models via Lifting

  • Department: Intelligent Data Analysis
  • Annotation:
    Lifting is an efficient technique to scale up graphical models generalized to relational domains by exploiting the underlying symmetries. Concurrently, neural models are continuously expanding from grid-like tensor data into structured representations, such as various attributed graphs and relational databases. To address the irregular structure of the data, the models typically extrapolate on the idea of convolution, effectively introducing parameter sharing in their, dynamically unfolded, computation graphs. The computation graphs themselves then reflect the symmetries of the underlying data, similarly to the lifted graphical models. Inspired by lifting, we introduce a simple and efficient technique to detect the symmetries and compress the neural models without loss of any information. We demonstrate through experiments that such compression can lead to significant speedups of structured convolutional models, such as various Graph Neural Networks, across various tasks, such as molecule classification and knowledge-base completion.

Optimal sports betting strategies in practice: an experimental review

  • DOI: 10.1093/imaman/dpaa029
  • Link: https://doi.org/10.1093/imaman/dpaa029
  • Department: Intelligent Data Analysis
  • Annotation:
    We investigate the most popular approaches to the problem of sports betting investment based on modern portfolio theory and the Kelly criterion. We define the problem setting, the formal investment strategies and review their common modifications used in practice. The underlying purpose of the reviewed modifications is to mitigate the additional risk stemming from the unrealistic mathematical assumptions of the formal strategies. We test the resulting methods using a unified evaluation protocol for three sports: horse racing, basketball and soccer. The results show the practical necessity of the additional risk-control methods and demonstrate their individual benefits. Particularly, an adaptive variant of the popular ‘fractional Kelly’ method is a very suitable choice across a wide range of settings.

Finding Semantic Patterns in Omics Data Using Concept Rule Learning with an Ontology-based Refinement Operator

  • DOI: 10.1186/s13040-020-00219-6
  • Link: https://doi.org/10.1186/s13040-020-00219-6
  • Department: Intelligent Data Analysis
  • Annotation:
    Identification of non-trivial and meaningful patterns in omics data is one of the most important biological tasks. The patterns help to better understand biological systems and interpret experimental outcomes. A well-established method serving to explain such biological data is Gene Set Enrichment Analysis. However, this type of analysis is restricted to a specific type of evaluation. Abstracting from details, the analyst provides a sorted list of genes and ontological annotations of the individual genes; the method outputs a subset of ontological terms enriched in the gene list. Here, in contrary to enrichment analysis, we introduce a new tool/framework that allows for the induction of more complex patterns of 2-dimensional binary omics data. This extension allows to discover and describe semantically coherent biclusters.

Learning with Molecules beyond Graph Neural Networks

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    In this paper we demonstrate a deep learning framework which is inherently based in the highly expressive language of relational logic, enabling to, among other things, capture arbitrarily complex graph structures. We show how GNNs and similar models can be easily covered in the framework by specifying the underlying propagation rules in the relational logic. The declarative nature of the used language then allows to easily modify and extend the propagation schemes into complex structures, such as the molecular rings which we choose for a short demonstration in this paper.

Deep Learning from Spatial Relations for Soccer Pass Prediction

  • DOI: 10.1007/978-3-030-17274-9_14
  • Link: https://doi.org/10.1007/978-3-030-17274-9_14
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We propose a convolutional architecture for learning representations over spatial relations in the game of soccer, with the goal to predict individual passes between players, as a submission to the prediction challenge organized for the 5th Workshop on Machine Learning and Data Mining for Sports Analytics. The goal of the challenge was to predict the receiver of a pass given location of the sender and all other players. From each soccer situation, we extract spatial relations between the players and a few key locations on the field, which are then hierarchically aggregated within the neural architecture designed to extract possibly complex gameplay patterns stemming from these simple relations. The use of convolutions then allows to efficiently capture the various regularities that are inherent to the game. In the experiments, we show very promising performance of the method.

Efficient Extraction of Network Event Types from NetFlows

  • DOI: 10.1155/2019/8954914
  • Link: https://doi.org/10.1155/2019/8954914
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    To perform sophisticated traffic analysis, such as intrusion detection, network monitoring tools firstly need to extract higher-level information from lower-level data by reconstructing events and activities from as primitive information as individual network packets or traffic flows. Aggregating communication data into meaningful entities is an open problem and existing, typically clustering-based, solutions are often highly suboptimal, producing results that may misinterpret the extracted information and consequently miss many network events. We propose a novel method for the extraction of various predefined types of network events from raw network flow data. The new method is based on analysis of computational properties of the event types as prescribed by their attributes in a given descriptive language. The corresponding events are then extracted with a supreme recall as compared to a respective event extraction part of an in-production intrusion detection system Camnep.

Estimating sequence similarity from read sets for clustering next-generation sequencing data

  • DOI: 10.1007/s10618-018-0584-8
  • Link: https://doi.org/10.1007/s10618-018-0584-8
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences; rather they identify a large number of small substrings called reads, sampled at random places of the target sequence. To estimate similarity of two sequences from their read-set representations, one may try to reconstruct each one first from its read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. Due to the nature of data, sequence assembly often cannot provide a single putative sequence that matches the true DNA. Therefore, we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases, avoiding the sequence assembly step. For low-coverage (i.e. small) read set samples, it yields a better approximation of the true sequence similarities. This in turn results in better clustering in comparison to the first-assemble-then-cluster approach. Put differently, for a fixed estimation accuracy, our approach requires smaller read sets and thus entails reduced wet-lab costs.

Evaluating Model-free Directional Dependency Methods onSingle-cell RNA Sequencing Data with Severe Dropout

  • Department: Intelligent Data Analysis
  • Annotation:
    As severe dropout in single-cell RNA sequencing (scRNA-seq) degrades data quality, current methods for network in-ference face increased uncertainty from such data. To exa-mine how dropout influences directional dependency infe-rence from scRNA-seq data, we thus studied four methodsbased on discrete data that are model-free without paramet-ric model assumptions. They include two established me-thods: conditional entropy and Kruskal-Wallis test, and tworecent methods: causal inference by stochastic complexityand function index. We also included three non-directionalmethods for a contrast. On simulated data, function indexperformed most favorably at varying dropout rates, samplesizes, and discrete levels. On an scRNA-seq dataset from de-veloping mouse cerebella, function index and Kruskal-Wallistest performed favorably over other methods in detectingexpression of developmental genes as a function of time.Overall among the four methods, function index is mostresistant to dropout for both directional and dependencyinference. The next best choice, Kruskal-Wallis test, carriesa directional bias towards a uniformly distributed variable.We conclude that a method robust to marginal distributi-ons with a sufficiently large sample size can reap benefitsof single-cell over bulk RNA sequencing in understandingmolecular mechanisms at the cellular resolution.

Exploiting sports-betting market using machine learning

  • DOI: 10.1016/j.ijforecast.2019.01.001
  • Link: https://doi.org/10.1016/j.ijforecast.2019.01.001
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We introduce a forecasting system designed to profit from sports-betting market using machine learning. We contribute three main novel ingredients. First, previous attempts to learn models for match-outcome prediction maximized the model's predictive accuracy as the single criterion. Unlike these approaches, we also reduce the model's correlation with the bookmaker's predictions available through the published odds. We show that such an optimized model allows for better profit generation, and the approach is thus a way to `exploit' the bookmaker. The second novelty is in the application of convolutional neural networks for match outcome prediction. The convolution layer enables to leverage a vast number of player-related statistics on its input. Thirdly, we adopt elements of the modern portfolio theory to design a strategy for bet distribution according to the odds and model predictions, trading off profit expectation and variance optimally. These three ingredients combine towards a betting method yielding positive cumulative profits in experiments with NBA data from seasons 2007--2014 systematically, as opposed to alternative methods tested.

Learning to predict soccer results from relational data with gradient boosted trees

  • DOI: 10.1007/s10994-018-5704-6
  • Link: https://doi.org/10.1007/s10994-018-5704-6
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We describe our winning solution to the 2017’s Soccer Prediction Challenge organized in conjunction with the MLJ’s special issue on Machine Learning for Soccer. The goal of the challenge was to predict outcomes of future matches within a selected time-frame from different leagues over the world. A dataset of over 200,000 past match outcomes was provided to the contestants. We experimented with both relational and feature-based methods to learn predictive models from the provided data. We employed relevant latent variables computable from the data, namely so called pi-ratings and also a rating based on the PageRank method. A method based on manually constructed features and the gradient boosted tree algorithm performed best on both the validation set and the challenge test set. We also discuss the validity of the assumption that probability predictions on the three ordinal match outcomes should be monotone, underlying the RPS measure of prediction quality.

Score-based Soccer Match Outcome Modeling - an Experimental Review

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    In this experimental work, we propose to investigate the state-of-the-art in score-basedsoccer match outcome prediction modeling to identify the top-performing methods acrossthe diverse classes of existing approaches to the problem. Namely, we bring together sta-tistical methods based on Poisson distribution, a general ranking algorithm (Elo), domain-specific rating system (pi-ratings) and a graph-based approach to the problem (PageRank).We experimentally compare these diverse competitors altogether on a large database ofsoccer results to identify the true leaders in the domain.

Sports betting strategies: an experimental review

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We investigate the problem of optimal wealth allocation over predictive sports mar-ket’s opportunities. We analyze the problem across diverse settings, utility targets, andthe notion of optimality itself. We review existing literature to identify the most prominentapproaches coming from the diverse sport and economic views on the problem, and providesome practical perspectives. Namely, we focus on the provably optimal geometric meanpolicy, typically referred to as the Kelly criterion, and Modern Portfolio Theory based ap-proaches leveraging utility theory. From the joint perspective of decision theory, we discusstheir unique properties, assumptions and, importantly, investigate effective heuristics andpractical techniques to tackle their key common challenges, particularly the problem ofuncertainty in the outcome probability estimates. Finally, we verify our findings on a largedataset of soccer records.

Genomic single rule learning with an ontology-based refinement operator

  • Department: Intelligent Data Analysis
  • Annotation:
    Rule learning is a kind of machine learning method that induces a set of classification rules from a given set of training examples. As a well-known representative of this learners, we can adduce CN2, RIPPER, or PRIM. All of them use if-then statement for corresponding hypothesis formulation where the antecedent is in the form of a conjunction of logical terms, and the consequent is a class label. From a bioinformatician point of view, these learners are suitable especially for their easy and clear interpretation of hypothesis on the contrary of a neural network, for example. The other thing that can help biologists interpret their data in a more natural way is a background knowledge. Nowadays, the most popular form of background knowledge in the field of bioinformatics are ontologies, especially Gene Ontology or Disease Ontology. There are other types of structured databases such as KEGG, that can also be interpreted as an ontology or a taxonomy. In our work, we combine these two concepts, rule learning and ontologies/taxonomies, together

Lifted Relational Neural Networks: Efficient Learning of Latent Relational Structures

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We propose a method to combine the interpretability and expressive power of firstorder logic with the effectiveness of neural network learning. In particular, we introduce a lifted framework in which first-order rules are used to describe the structure of a given problem setting. These rules are then used as a template for constructing a number of neural networks, one for each training and testing example. As the different networks corresponding to different examples share their weights, these weights can be efficiently learned using stochastic gradient descent. Our framework provides a flexible way for implementing and combining a wide variety of modelling constructs. In particular, the use of first-order logic allows for a declarative specification of latent relational structures, which can then be efficiently discovered in a given data set using neural network learning. Experiments on 78 relational learning benchmarks clearly demonstrate the effectiveness of the framework.

Lifted Relational Team Embeddings for Predictive Sports Analytics

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We investigate the use of relational learning in domain of predictive sports analytics, for which we propose a team embedding concept expressed in the language of Lifted relational neural networks, a framework for learning of latent relational structures. On a large dataset of soccer results, we compare different relational learners against strong current methods from the domain to show some very promising results of the relational approach when combined with embedding learning

Pruning Hypothesis Spaces Using Learned Domain Theories

  • DOI: 10.1007/978-3-319-78090-0_11
  • Link: https://doi.org/10.1007/978-3-319-78090-0_11
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We present a method to prune hypothesis spaces in the con- text of inductive logic programming. The main strategy of our method consists in removing hypotheses that are equivalent to already consid- ered hypotheses. The distinguishing feature of our method is that we use learned domain theories to check for equivalence, in contrast to existing approaches which only prune isomorphic hypotheses. Specifically, we use such learned domain theories to saturate hypotheses and then check if these saturations are isomorphic. While conceptually simple, we exper- imentally show that the resulting pruning strategy can be surprisingly effective in reducing both computation time and memory consumption when searching for long clauses, compared to approaches that only con- sider isomorphism.

Stacked Structure Learning for Lifted Relational Neural Networks

  • DOI: 10.1007/978-3-319-78090-0_10
  • Link: https://doi.org/10.1007/978-3-319-78090-0_10
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    Lifted Relational Neural Networks (LRNNs) describe relational domains using weighted first-order rules which act as templates for constructing feed-forward neural networks. While previous work has shown that using LRNNs can lead to state-of-the-art results in various ILP tasks, these results depended on hand-crafted rules. In this paper, we extend the framework of LRNNs with structure learning, thus enabling a fully automated learning process. Similarly to many ILP methods, our structure learning algorithm proceeds in an iterative fashion by top-down searching through the hypothesis space of all possible Horn clauses, considering the predicates that occur in the training examples as well as invented soft concepts entailed by the best weighted rules found so far. In the experiments, we demonstrate the ability to automatically induce useful hierarchical soft concepts leading to deep LRNNs with a competitive predictive power.

A First-Order Axiomatization for Transition Learning with RichConstraints

  • Authors: Barvínek, J., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Late Breaking Papers of the 27th International Conference on Inductive Logic Programming. Tilburg: CEUR Workshop Proceedings, 2017. p. 1-5. ISSN 1613-0073.
  • Year: 2017
  • Department: Intelligent Data Analysis
  • Annotation:
    We address the task of learning a dynamic Boolean network model from dataabout its state transitions, and constraints regarding the known attractors ofthe system. To this end, we propose a learning strategy where such priorknowledge is compiled into a first-order theory along with axioms describ-ing the Boolean transitions functions and the language bias for their repre-sentation. The learning task is then posed as a model-finding task, in that theBoolean network is obtained as a model of the input first-order theory. Withthis framework, we support experimentally the hypothesis that attractor con-straints reduce the number of state transition examples needed to identify thetarget model.

Estimating Sequence Similarity from Contig Sets

  • Authors: Bc. Petr Ryšavý, MSc., Ph.D., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Düsseldorf: Springer VDI Verlag, 2017. p. 272-283. ISSN 0302-9743. ISBN 978-3-319-68764-3.
  • Year: 2017
  • DOI: 10.1007/978-3-319-68765-0_23
  • Link: https://doi.org/10.1007/978-3-319-68765-0_23
  • Department: Department of Computer Science
  • Annotation:
    A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.

Learning Predictive Categories Using Lifted Relational Neural Networks

Semantic biclustering for finding local, interpretable and predictive expression patterns

  • DOI: 10.1186/s12864-017-4132-5
  • Link: https://doi.org/10.1186/s12864-017-4132-5
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    Background: One of the major challenges in the analysis of gene expression data is to identify local patterns composed of genes showing coherent expression across subsets of experimental conditions. Such patterns may provide an understanding of underlying biological processes related to these conditions. This understanding can further be improved by providing concise characterizations of the genes and situations delimiting the pattern. Results: We propose a method called semantic biclustering with the aim to detect interpretable rectangular patterns in binary data matrices. As usual in biclustering, we seek homogeneous submatrices, however, we also require that the included elements can be jointly described in terms of semantic annotations pertaining to both rows (genes) and columns (samples). To find such interpretable biclusters, we explore two strategies. The first endows an existing biclustering algorithm with the semantic ingredients. The other is based on rule and tree learning known from machine learning. Conclusions: The two alternatives are tested in experiments with two Drosophila melanogaster gene expression datasets. Both strategies are shown to detect sets of compact biclusters with semantic descriptions that also remain largely valid for unseen (testing) data. This desirable generalization aspect is more emphasized in the strategy stemming from conventional biclustering although this is traded off by the complexity of the descriptions (number of ontology terms employed), which, on the other hand, is lower for the alternative strategy.

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

  • DOI: 10.1007/978-3-319-46349-0_18
  • Link: https://doi.org/10.1007/978-3-319-46349-0_18
  • Department: Department of Computer Science
  • Annotation:
    Clustering biological sequences is a central task in bioinformatics. The typical result of new-generation sequencers is a set of short substrings (“reads”) of a target sequence, rather than the sequence itself. To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. It avoids the NP-hard problem of sequence assembly and in empirical experiments it results in a better approximation of the true sequence similarities and consequently in better clustering, in comparison to the first-assemble-then-cluster approach.

Lifted Relational Neural Networks

  • Department: Department of Computer Science
  • Annotation:
    We propose a method combining relational-logic representations with neural network learning. A general lifted architecture, possibly reflecting some background domain knowledge, is described through relational rules which may be handcrafted or learned. The relational rule-set serves as a template for unfolding possibly deep neural networks whose structures also reflect the structures of given training or testing relational examples. Different networks corresponding to different examples share their weights, which co-evolve during training by stochastic gradient descent algorithm. The framework allows for hierarchical relational modeling constructs and learning of latent relational concepts through shared hidden layers weights corresponding to the rules. Discovery of notable relational concepts and experiments on 78 relational learning benchmarks demonstrate favorable performance of the method.

Polynomial and Extensible Solutions in Lock-Chart Solving

  • DOI: 10.1080/08839514.2017.1279110
  • Link: https://doi.org/10.1080/08839514.2017.1279110
  • Department: Department of Computer Science
  • Annotation:
    Lock-chart (also known as master-key system) solving is a process for designing mechanical keys and locks so that every key can open and be blocked in a user-defined set of locks. We present several algorithms as a result of our 3 years industrial collaboration project, chosen by their asymptotic time complexity and extensibility, which is the ability to add new keys and locks and satisfy the previously unforeseen blocking requirements. Our strongest result is a linear-time algorithm to find the largest diagonal lock-chart with a condition on mechanical properties of keys. Without these restrictions, but given a solution for master keys, the problem is translated into a maximum independent set problem and a greedy approximation is proposed. We evaluate the algorithms on a generated dataset using a non-backtracking procedure to simulate minimal extensions. Both approaches are suggested as heuristics or subroutines for a backtracking constraint satisfaction problem-based (CSP) solver.

Semantic Biclustering Analysis of Gene Expression Data

  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    Biclustering is a well-known technique for seeking local patterns in a gene matrix. This matrix is analysed simultaneously in two dimensions that contain genes (rows) and samples (columns). Biclustering algorithm identifies subsets of genes in subsets of samples. Typically, semantic clustering of gene expression matrix is a clustering of genes based on their expression values where each cluster is identified by Gene ontology terms.

Semantic Biclustering: A New Way to Analyze and Interpret Gene Expression Data

  • DOI: 10.1007/978-3-319-38782-6
  • Link: https://doi.org/10.1007/978-3-319-38782-6
  • Department: Department of Computer Science, Intelligent Data Analysis
  • Annotation:
    We motivate and define the task of semantic biclustering. In an input gene expression matrix, the task is to discover homogeneous biclusters allowing joint characterization of the contained elements in terms of knowledge pertaining to both the rows (e.g. genes) and the columns (e.g. situations). We propose two approaches to solve the task, based on adaptations of current biclustering, enrichment, and rule and tree learning methods. We compare the approaches in experiments with Drosophila ovary gene expression data. Our findings indicate that both the proposed methods induce compact bicluster sets whose description is applicable to unseen data. The bicluster enrichment method achieves the best performance in terms of the area under the ROC curve, at the price of employing a large number of ontology terms to describe the biclusters.

Ensemble Learning of Runtime Prediction Models for Gene-expression Analysis Workflows

  • Authors: Monge, D.A., Holec, M., prof. Ing. Filip Železný, Ph.D., Garino, C.G.
  • Publication: Cluster Computing. 2015, 18(4), 1317-1329. ISSN 1386-7857.
  • Year: 2015
  • DOI: 10.1007/s10586-015-0481-5
  • Link: https://doi.org/10.1007/s10586-015-0481-5
  • Department: Department of Computer Science
  • Annotation:
    The adequate management of scientific workflow applications strongly depends on the availability of accurate performance models of sub-tasks. Numerous approaches use machine learning to generate such models autonomously, thus alleviating the human effort associated to this process. However, these standalone models may lack robustness, leading to a decay on the quality of information provided to workflow systems on top. This paper presents a novel approach for learning ensemble prediction models of tasks runtime. The ensemble-learning method entitled bootstrap aggregating (bagging) is used to produce robust ensembles of M5P regression trees of better predictive performance than could be achieved by standalone models. Our approach has been tested on gene expression analysis workflows. The results show that the ensemble method leads to significant prediction-error reductions when compared with learned standalone models. This is the first initiative using ensemble learning for generating performance prediction models. These promising results encourage further research in this direction.

Learning Running-time Prediction Models for Gene-Expression Analysis Workflows

  • Authors: Monge, D.A., Holec, M., prof. Ing. Filip Železný, Ph.D., Garino, CG
  • Publication: IEEE LATIN AMERICA TRANSACTIONS. 2015, 13(9), 3088-3095. ISSN 1548-0992.
  • Year: 2015
  • Department: Department of Computer Science
  • Annotation:
    One of the central issues for the efficient management of Scientific workflow applications is the prediction of tasks performance. This paper proposes a novel approach for constructing performance models for tasks in data-intensive scientific workflows in an autonomous way. Ensemble Machine Learning techniques are used to produce robust combined models with high predictive accuracy. Information derived from workflow systems and the characteristics and provenance of the data are exploited to guarantee the accuracy of the models. A gene-expression analysis workflow application was used as case study over homogeneous and heterogeneous computing environments. Experimental results evidence noticeable improvements while using ensemble models in comparison with single/standalone prediction models. Ensemble learning techniques made it possible to reduce the prediction error with respect to the strategies of a single-model with values ranging from 14.47% to 28.36% for the homogeneous case, and from 8.34% to 17.18% for the heterogeneous case.

Learning to detect network intrusion from a few labeled events and background traffic

  • DOI: 10.1007/978-3-319-20034-7_9
  • Link: https://doi.org/10.1007/978-3-319-20034-7_9
  • Department: Department of Computer Science
  • Annotation:
    Intrusion detection systems (IDS) analyse network traffic data with the goal to reveal malicious activities and incidents. A general problem with learning within this domain is a lack of relevant ground truth data, i.e. real attacks, capturing malicious behaviors in their full variety. Most of existing solutions thus, up to a certain level, rely on rules designed by network domain experts. Although there are advantages to the use of rules, they lack the basic ability of adapting to traffic data. As a result, we propose an ensemble tree bagging classifier, capable of learning from an extremely small number of true attack representatives, and demonstrate that, incorporating a general background traffic, we are able to generalize from those few representatives to achieve competitive results to the expert designed rules used in existing IDS Camnep.

Novel Gene Sets Improve Set-level Classification of Prokaryotic Gene Expression Data

  • DOI: 10.1186/s12859-015-0786-7
  • Link: https://doi.org/10.1186/s12859-015-0786-7
  • Department: Department of Computer Science
  • Annotation:
    Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers.

A Method for Reduction of Examples in Relational Learning

Automatic Analysis of Spatial Gene Expression Patterns

  • Department: Department of Cybernetics, Department of Computer Science
  • Annotation:
    Gene functionality is of paramount importance in basic biological research with possible therapeutic applications in medicine. Genes turn on in specific cells and at a specific time-point (gene expression pattern).

Binary Classification of Metagenomic Samples Using Discriminative DNA Superstrings

  • Authors: Jalovec, K., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Proceedings of the Eighth International Workshop on Machine Learning in Systems Biology. Paris: Université Paris V, 2014, pp. 44-47.
  • Year: 2014
  • Department: Department of Computer Science
  • Annotation:
    Increasing amount of data obtained by the NGS technologies increases the urge of effective analysis of this data. This work presents a tool for binary classification of metagenomic samples. Metagenomic samples consist of a large amount of short DNA strings (also called reads), which belong to different organisms present in an environ- ment from which the sample was taken. Behavior of an environment can be affected by the contamination by the organisms, which originaly do not belong in this envi- ronment. The goal of this work is to develop a classification method based on DNA superstrings that can accurately classify metagenomic samples. Classifiers obtained by this method can be used for determining whether newly obtained metagenomic samples are contaminated (positive) or clean (negative) without the need of identifi- cation of particular organisms present in the sample. We want to achieve this goal by establishing a modified sequence assembly task for finding the most discriminatory DNA superstrings. We assume that standard a approach for this kind of analysis would be to assemble all the samples and try to find the most discriminatory motifs. Both tasks are very com- putationally demanding. Our method should solve both these tasks simultaneously.

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

  • Authors: Monge, D A, Holec, M., prof. Ing. Filip Železný, Ph.D., Garino, C G
  • Publication: High Performance Computing. Heidelberg: Springer, 2014. pp. 83-97. Communications in Computer and Information Science. ISSN 1865-0929. ISBN 978-3-662-45482-4.
  • Year: 2014
  • DOI: 10.1007/978-3-662-45483-1_7
  • Link: https://doi.org/10.1007/978-3-662-45483-1_7
  • Department: Department of Computer Science
  • Annotation:
    This paper proposes a novel approach that enables the construction models for predicting task’s running-times of data-intensive scientific workflows. Ensemble Machine Learning techniques are used to produce robust combined models with high predictive accuracy. Information derived from workflow systems and the characteristics and provenance of the data are exploited to guarantee the accuracy of the models. The proposed approach has been tested on Bioinformatics workflows for Gene Expressions Analysis over homogeneous and heterogeneous computing environments.

Predicting Top-k Trends on Twitter using Graphlets and Time Features

  • Department: Department of Computer Science
  • Annotation:
    We introduce a novel method for predicting trending keywords on Twitter. This new method exploits topology of the studied parts of the social network. It is based on a combination of graphlet spectra and so-called time features. We show experimentally that using graphlets and time features is bene cial for the accuracy of prediction.

Satisfiability Machines

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: Latest Advances in Inductive Logic Programming. London: Imperial College Press, 2014. p. 113-120. ISBN 978-1-78326-508-4.
  • Year: 2014
  • Department: Department of Computer Science
  • Annotation:
    We propose a probabilistic model for formulas, in which a formula’s probability decreases with the number of pre-defined con- straints it contradicts. Probability of ground conjunctions can be com- puted by a series of subsumption checks. The probability is equiv alent (up to a multiplicative constant) to that computed by a Markov Logic Network for conjunctions that fully describe a possible world. An ex per- iment indicates that the two quantities correlate also for other conju nc- tions, with less variance for larger conjunctions. The proposed framewor k suggests a simple classification principle.

Tubular atrophy and low netrin-1 expression leves are risk factors associated with delayed kidney allograft function

  • Authors: Wohlfahrtova, M., Brabcova, I., prof. Ing. Filip Železný, Ph.D., Balaz, P., Janousek, L., Honsova, E., Lodererova, A., Wohlfahrt, P., Viklicky, O.
  • Publication: Transplantation. 2014, 97(2), 176-183. ISSN 0041-1337.
  • Year: 2014
  • DOI: 10.1097/TP.0b013e3182a95d04
  • Link: https://doi.org/10.1097/TP.0b013e3182a95d04
  • Department: Department of Computer Science
  • Annotation:
    Poor baseline tubular cell quality (defined by a higher rate of tubular atrophy) combined with the reduced potential of apoptotic survival factors represented by decreased Netrin-1 gene expression were associated with delayed kidney graft function.

Bounded Least General Generalization

  • Authors: Ing. Ondřej Kuželka, Ph.D., Szabóová, A., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Inductive Logic Programming, 22nd International Conference, ILP 2012, Dubrovnik, Croatia, September 17-19, 2012, Revised Selected Papers. Berlin: Springer, 2013, pp. 116-129. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-642-38811-8. Available from: http://link.springer.com/chapter/10.1007/978-3-642-38812-5_9
  • Year: 2013
  • DOI: 10.1007/978-3-642-38812-5_9
  • Link: https://doi.org/10.1007/978-3-642-38812-5_9
  • Department: Department of Computer Science
  • Annotation:
    We study a generalization of Plotkin’s least general generalization. We introduce a novel concept called bounded least general generalization w.r.t. a set of clauses and show an instance of it for which polynomial-time reduction procedures exist. We demonstrate the practical utility of our approach in experiments on several relational learning datasets.

Exploiting knowledge in knowledge discovery

  • Department: Department of Computer Science
  • Annotation:
    The chapter deals with machine learning methods for exploiting knowledge in the process of knowledge discovery.

Formulating the template ILP consistency problem as a constraint satisfaction problem

Reducing Examples in Relational Learning with Bounded-Treewidth Hypotheses

  • Authors: Ing. Ondřej Kuželka, Ph.D., Szabóová, A., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Proceedings of the Workshop on New Frontiers In Mining Complex Patterns. Naples: ICAR - National Research Council, 2013, pp. 17-32. ISSN 0302-9743. ISBN 978-3-642-37381-7. Available from: http://www.di.uniba.it/~ceci/micFiles/papers/NFMCP.pdf
  • Year: 2013
  • DOI: 10.1007/978-3-642-37382-4_2
  • Link: https://doi.org/10.1007/978-3-642-37382-4_2
  • Department: Department of Cybernetics
  • Annotation:
    We study reducibility of learning examples in the learning from entailment setting. We start with an existing reduction method and improve it for the case when learned hypotheses are restricted to have bounded treewidth. We show that in such cases there is a polynomial-time reduction algorithm which reduces the learning examples at least as much as the existing exponential-time algorithm despite the fact that the examples which are reduced can have arbitrarily high treewidth.

Comparative Evaluation of Set-level Techniques in Predictive Classification of Gene Expression Samples

  • DOI: 10.1186/1471-2105-13-S10-S15
  • Link: https://doi.org/10.1186/1471-2105-13-S10-S15
  • Department: Department of Cybernetics
  • Annotation:
    Background: Analysis of gene expression data in terms of a priori defined gene sets has recently received significant attention as this approach typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set level strategy can also be adopted with similar benefits in predictive classification tasks accomplished with machine learning algorithms. Initial studies into the predictive performance of set level classifiers have yielded rather controversial results. The goal of this study is to provide a more conclusive evaluation by testing various components of the set level framework within a large collection of machine learning experiments. Results: Genuine curated gene sets constitute better features for classification than sets assembled without biological relevance. For identifying the best gene sets for classification, the Global test outperforms the gene set methods GSEA and SAM GS as well as two generic feature selection methods. To aggregate expressions of genes into a feature value, the singular value decomposition (SVD) method as well as the SetSig technique improve on simple arithmetic averaging. Set level classifiers learned with 10 features constituted by the Global test slightly outperform baseline gene level classifiers learned with all original data features although they are slightly less accurate than gene level classifiers learned with a prior feature selection step. Conclusion: Set level classifiers do not boost predictive accuracy, however, they do achieve competitive accuracy if learned with the right combination of ingredients. Availability: Open-source, publicly available software was used for classifier learning and testing. The gene expression datasets and the gene set database used are also publicly available. The full tabulation of experimental results is available at http://ida.felk.cvut.cz/CESLT.

Constraint Satisfaction for Learning Hypotheses in Inductive Logic Programming

  • Department: Department of Cybernetics
  • Annotation:
    The chapter is an introductory overview of constraint-satisfaction techniques for learning hypotheses in inductive logic programming.

Differential Regulation of the Nuclear Factor-kappa B Pathway by Rabbit Antithymocyte Globulins in Kidney Transplantation

  • Authors: Urbanová, M., Brabcová, I., Girmanová, E., prof. Ing. Filip Železný, Ph.D., Viklický, O.
  • Publication: Transplantation. 2012, 93(6), 589-596. ISSN 0041-1337.
  • Year: 2012
  • DOI: 10.1097/TP.0b013e31824491aa
  • Link: https://doi.org/10.1097/TP.0b013e31824491aa
  • Department: Department of Cybernetics
  • Annotation:
    Background. Induction therapy is associated with excellent short-term kidney graft outcome. The aim of this study was to evaluate differences in the intragraft transcriptome after successful induction therapy using two rabbit antithymocyte globulins. Methods. The expression of 376 target genes involved in tolerance, inflammation, T- and B-cell immune response, and apoptosis was evaluated using the quantitative real-time reverse-transcriptase polymerase chain reaction (2-ΔΔCt) method in kidney graft biopsies with normal histological findings and stable renal function, 3 months posttransplantation after induction therapy with Thymoglobulin, ATG-Fresenius S (ATG-F), and a control group without induction therapy. Results. The transcriptional pattern induced by Thymoglobulin differed from ATG-F in 18 differentially expressed genes. Down-regulation of genes involved in the nuclear factor-κB pathway (TLR4, MYD88, and CD209), costimulation (CD80 and CTLA4), apoptosis (NLRP1), chemoattraction (CCR10), and dendritic cell function (CLEC4C) was observed in the biopsies from patients treated with Thymoglobulin. A hierarchical clustering analysis clearly separated the Thymoglobulin group from the ATG-F group, while the control group had a similar profile as the Thymoglobulin group. Conclusions. Despite normal morphology in graft biopsy taken 3 months posttransplantation, the intrarenal transcriptome differed in patients treated with induction therapy using different rATGs. In the Thymoglobulin high-risk group, the transcriptome profile was identical to the low-risk group. Therefore, the down-regulation of the nuclear factor-κB pathway after Thymoglobulin induction in vivo is likely to explain the clinical success of this biologic.

Extending the Ball-Histogram Method with Continuous Distributions and an Application to Prediction of DNA-Binding Proteins

  • DOI: 10.1109/BIBM.2012.6392699
  • Link: https://doi.org/10.1109/BIBM.2012.6392699
  • Department: Department of Cybernetics
  • Annotation:
    We introduce a novel method for prediction of DNA-binding propensity of proteins which extends our recently introduced ball-histogram method (Szabóová et al. 2012). Unlike the original ball-histogram method, it allows handling of continuous properties of protein regions. In experiments on four datasets of proteins, we show that the method improves upon the original ball-histogram method as well as other existing methods in terms of predictive accuracy.

Inductive Logic Programming

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: Encyclopedia of the Sciences of Learning. Dordrecht: Springer, 2012. p. 1537-1539. ISBN 978-1-4419-1427-9.
  • Year: 2012

Planning to Learn: Recent Developments and Future Directions

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: International Workshop on Planning to Learn 2012. Aachen: RWTH Aachen University, 2012. p. 1-3. CEUR workshop proceedings. ISSN 1613-0073.
  • Year: 2012
  • Department: Department of Computer Science
  • Annotation:
    The talk will cover my lab's recent research concerning planning to learn and discuss its relationships to relevant work of other researchers

Prediction of Antimicrobial Activity of Peptides using Relational Machine Learning

  • Department: Department of Cybernetics
  • Annotation:
    We apply relational machine learning techniques to predict antimicrobial activity of peptides. We follow our successful strategy (Szabóová et al., MLSB 2010) to prediction of DNA-binding propensity of proteins from structural features. We exploit structure prediction methods to obtain peptides' spatial structures, then we construct the structural relational features. We use these relational features as attributes in a regression model. We apply this methodology to antimicrobial activity prediction of peptides achieving better predictive accuracies than a state-of-the-art approach.

Prediction of DNA-Binding Propensity of Proteins by the Ball-Histogram Method Using Automatic Template Search

  • DOI: 10.1186/1471-2105-13-S10-S3
  • Link: https://doi.org/10.1186/1471-2105-13-S10-S3
  • Department: Department of Cybernetics
  • Annotation:
    We contribute a novel, ball-histogram approach to DNA-binding propensity prediction of proteins. Unlike state-of-the-art methods based on constructing an ad-hoc set of features describing physicochemical properties of the proteins, the ball-histogram technique enables a systematic, Monte-Carlo exploration of the spatial distribution of amino acids complying with automatically selected properties. This exploration yields a model for the prediction of DNA binding propensity. We validate our method in prediction experiments, improving on state-of-the-art accuracies. Moreover, our method also provides interpretable features involving spatial distributions of selected amino acids.

Prediction of DNA-Binding Proteins from Relational Features

  • DOI: 10.1186/1477-5956-10-66
  • Link: https://doi.org/10.1186/1477-5956-10-66
  • Department: Department of Cybernetics
  • Annotation:
    The process of protein-DNA binding has an essential role in the biological processing of genetic information. We use relational machine learning to predict DNA-binding propensity of proteins from their structures. Automatically discovered structural features are able to capture some characteristic spatial configurations of amino acids in proteins. Prediction based only on structural relational features already achieves competitive results to existing methods based on physicochemical properties on several protein datasets. Predictive performance is further improved when structural features are combined with physicochemical features. Moreover, the structural features provide some insights not revealed by physicochemical features. Our method is able to detect common spatial substructures. We demonstrate this in experiments with zinc finger proteins.

Relational Learning with Polynomials

  • DOI: 10.1109/ICTAI.2012.163
  • Link: https://doi.org/10.1109/ICTAI.2012.163
  • Department: Department of Cybernetics
  • Annotation:
    We describe a conceptually simple framework for transformation-based learning in hybrid relational domains. The proposed approach is related to hybrid Markov logic and to Gaussian logic framework. We evaluate the approach in three domains and show that it can achieve state-of-the-art performance while using only limited amount of information.

Subgroup Discovery through Bump Hunting on Relational Histograms

  • Authors: Černoch, R., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Inductive Logic Programming. Heidelberg: Springer, 2012. pp. 76-90. ISSN 0302-9743. ISBN 978-3-642-31950-1.
  • Year: 2012
  • DOI: 10.1007/978-3-642-31951-8_11
  • Link: https://doi.org/10.1007/978-3-642-31951-8_11
  • Department: Department of Computer Science
  • Annotation:
    We propose an approach to subgroup discovery in relational databases containing numerical attributes. The approach is based on detecting bumps in histograms constructed from substitution sets resulting from matching a rst-order query against the input relational database. The approach is evaluated on seven data sets, discovering interpretable subgroups. The subgroups' rate of survival from the training split to the testing split varies among the experimental data sets, but at least on three of them it is very high.

A Performance Prediction Module for Workflow Scheduling

  • Authors: Monge, D., Bělohradský, J., Garino, C.G., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Proceedings of the Fourth Symposium on High-Performance Computing. Buenos Aires: AADECA, 2011, pp. 1-15. ISSN 1851-9326. Available from: http://ida.felk.cvut.cz/cgi-bin/docarc/public.pl/document/186/hpc2011-monge.pdf
  • Year: 2011
  • Department: Department of Cybernetics
  • Annotation:
    Through the years, scientific applications have demanded more powerful and sophisticated computing environments and manage- ment techniques. Workflows facilitated the design and management of scientific applications. The complexity of today's workflows demand a high amount of resources and mechanisms for provisioning them. The execution of scientific workflow applications is a complex task and de- pends on how the resources are assigned. Scheduling is the name given to the process that assigns computing resources to the tasks comprised in a workflow. This work presents a scheduling algorithm (PPSA) for workflows tightly coupled to a performance prediction module (PEM). A set of experiments was developed for measuring the performance of the algorithm using the information provided by the proposed perfor- mance module. The proposed algorithm is compared with an algorithm included in the well-known workflow middlewares Condor DAGMan and ASKALON.

An Experimental Evaluation of Lifted Gene Sets

  • Authors: Ing. Ondřej Kuželka, Ph.D., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Workshop on Mining Complex Entities from Network and Biomedical Data. Aldo Moro: Department of Computer Science (DIB) University of Bari, 2011, pp. 14-23. Available from: http://www.di.uniba.it/~loglisci/MIND2011/ECML_PKDD_MIND_2011_Proceedings.pdf
  • Year: 2011
  • Department: Department of Cybernetics
  • Annotation:
    We introduce so-called lifted gene sets which are relational descriptions of sets of genes. We show that lifted gene sets are, to some extent, able to capture common structures in correlation matrices. In experiments with real-life microarray data, we show that, in fact, lifted gene sets can provide slightly better estimates of correlation than ground gene sets.

An Experimental Test of Occam's Razor in Classification

Automating Knowledge Discovery Workflow Composition Through Ontology-Based Planning

  • DOI: 10.1109/TASE.2010.2070838
  • Link: https://doi.org/10.1109/TASE.2010.2070838
  • Department: Department of Cybernetics
  • Annotation:
    The problem addressed in this paper is the challenge of automated construction of knowledge discovery workflows, given the types of inputs and the required outputs of the knowl- edge discovery process. Our methodology consists of two main ingredients. The first one is defining a formal conceptualization of knowledge types and data mining algorithms by means of knowl- edge discovery ontology. The second one is workflow composition formalized as a planning task using the ontology of domain and task descriptions. Two versions of a forward chaining planning algorithm were developed. The baseline version demonstrates suitability of the knowledge discovery ontology for planning and uses Planning Domain Definition Language (PDDL) descriptions of algorithms; to this end, a procedure for converting data mining algorithm descriptions into PDDL was developed. The second directly queries the ontology using a reasoner.

Block-Wise Construction of Tree-like Relational Features with Monotone Reducibility and Redundancy

Comparative Evaluation of Set-Level Techniques in Microarray Classification

  • DOI: 10.1007/978-3-642-21260-4_27
  • Link: https://doi.org/10.1007/978-3-642-21260-4_27
  • Department: Department of Cybernetics
  • Annotation:
    Analysis of gene expression data in terms of a priori-defined gene sets typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set-level strategy can also be adopted in predictive classification tasks accomplished with machine learning algorithms. Here, sample features originally corresponding to genes are replaced by a much smaller number of features, each corresponding to a gene set and aggregating expressions of its members into a single real value. Classifiers learned from such transformed features promise better interpretability in that they derive class predictions from overall expressions of selected gene sets (e.g. corresponding to pathways) rather than expressions of specific genes. In a large collection of experiments we test how accurate such classifiers are compared to traditional classifiers based on genes.

Gaussian Logic and Its Applications in Bioinformatics

  • DOI: 10.1145/2147805.2147881
  • Link: https://doi.org/10.1145/2147805.2147881
  • Department: Department of Cybernetics
  • Annotation:
    We describe a novel statistical relational learning framework capable to work efficiently with combinations of relational and numerical data which is especially valuable in bioinformatics applications. We show how this model can be applied to modelling of gene expression data and to problems from proteomics.

Gaussian Logic for Predictive Classification

  • Authors: Ing. Ondřej Kuželka, Ph.D., Szabóová, A., Holec, M., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2011, pp. 277-292. ISSN 0302-9743. ISBN 978-3-642-23782-9. Available from: http://www.springerlink.com/content/527048g350795uh0
  • Year: 2011
  • DOI: 10.1007/978-3-642-23783-6_18
  • Link: https://doi.org/10.1007/978-3-642-23783-6_18
  • Department: Department of Cybernetics
  • Annotation:
    We describe a statistical relational learning framework called Gaussian Logic capable to work efficiently with combinations of relational and numerical data. The framework assumes that, for a fixed relational structure, the numerical data can be modelled by a multivariate normal distribution. We demonstrate how the Gaussian Logic framework can be applied to predictive classification problems. In experiments, we first show an application of the framework for the prediction of DNA-binding propensity of proteins. Next, we show how the Gaussian Logic framework can be used to find motifs describing highly correlated gene groups in gene-expression data which are then used in a set-level-based classification method.

Gaussian Logic for Proteomics and Genomics

  • Department: Department of Cybernetics
  • Annotation:
    We describe a statistical relational learning framework called Gaussian Logic capable to work efficiently with combinations of relational and numerical data. The framework assumes that, for a fixed relational structure, the numerical data can be modelled by a multivariate normal distribution. We show how the Gaussian Logic framework can be used to predict DNA-binding propensity of proteins and to find motifs describing novel gene sets which are then used in set-level classification of gene expression sample.

Prediction of DNA-Binding Propensity of Proteins by the Ball Histogram Method

  • Department: Department of Cybernetics
  • Annotation:
    We contribute a novel, ball-histogram approach to DNA-binding propensity prediction of proteins. Unlike state-of-the-art methods based on constructing an ad-hoc set of features describing the charged patches of the proteins, the ball-histogram technique enables a systematic, Monte-Carlo exploration of the spatial distribution of charged amino acids, capturing joint probabilities of specified amino acids occurring in certain distances from each other. This exploration yields a model for the prediction of DNA binding propensity. We validate our method in prediction experiments, achieving favorable accuracies. Moreover, our method also provides interpretable features involving spatial distributions of selected amino acids.

Template-Based Semi-Automatic Workflow Construction For Gene Expression Data Mining

  • Authors: Bělohradský, J., Monge, D., prof. Ing. Filip Železný, Ph.D., Garino, C.G., Holec, M.
  • Publication: Proceedings of the 24th International Symposium on Computer-Based Medical Systems. Piscataway: Institute of Electrical and Electronic Engineers, 2011, pp. 1-6. ISBN 978-1-4577-1190-9. Available from: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05999046
  • Year: 2011
  • DOI: 10.1109/CBMS.2011.5999046
  • Link: https://doi.org/10.1109/CBMS.2011.5999046
  • Department: Department of Cybernetics
  • Annotation:
    We propose a technique for semi-automatic construction of gene expression data analysis workflows by grammar-like inference based on predefined workflow templates. The templates represent routinely used sequences of procedures such as normalization, data transformation, classifier learning, etc. Variations of such workflows (such as different instantiations to specific algorithms) may entail significant variance in the quality of the analysis results and our formalism enables to automatically explore such variations. Adhering to proven templates helps preserve the sanity of explored workflows and prevents the combinatorial explosion encountered by fully automatic workflow planners. Here we propose the basic principles of template-based workflow construction and demonstrate their working in the publicly available tool XGENE.ORG for multiplatform gene expression analysis.

A Comparative Evaluation of Gene Set Analysis Techniques in Predictive Classification of Expression Samples

  • Authors: Holec, M., prof. Ing. Filip Železný, Ph.D., doc. Ing. Jiří Kléma, Ph.D., Tolar, J.
  • Publication: International Conference on Bioinformatics, Computational Biology, Genomics and Chemoinformatics. Orlando: International Society for Research in Science and Technology (ISRST), 2010, pp. 7-11. ISBN 978-1-60651-017-9. Available from: http://ida.felk.cvut.cz/cgi-bin/docarc/public.pl/document/148/bcbgc.pdf
  • Year: 2010
  • Department: Department of Cybernetics
  • Annotation:
    We demonstrate how some recently developed techniques of set level gene expression data analysis may be exploited in the context of predictive classification of gene expression samples for the tasks of attribute selection and extraction. With four benchmark gene expression datasets, we empirically test the influence of these method on the predictive accuracy of constructed classification models in a comparative setting. Our results mainly indicate that gene set selection methods (SAM GS and the global test) can boost the predictive accuracy if used with caution.

Mining Frequent Spatial Docking Patterns in Zinc Finger DNA Complexes

  • Authors: Szabóová, A., Ing. Ondřej Kuželka, Ph.D., prof. Ing. Filip Železný, Ph.D., Tolar, J.
  • Publication: Proceedings of the 7th European Symposium on Biomedical Engineering. University of Patras, 2010, pp. !!!. ISBN 978-960-85715-5-6. Available from: http://www.medicon2010.org/MEDICON_2010_20100413_SK.htm
  • Year: 2010
  • Department: Department of Cybernetics
  • Annotation:
    We use techniques of logic based relational machine learning to automatically detect spatial patterns common to 21 previously described examples of zinc finger DNA complexes. We demonstrate that such patterns can be found and thus the proposed methodology may potentially serve to achieve better understanding of zinc finger DNA binding.

Prediction of DNA Binding Proteins from Structural Features

  • Department: Department of Cybernetics
  • Annotation:
    We use logic-based machine learning to distinguish DNA- binding proteins from non-binding proteins. We combine previously suggested coarse-grained features (such as the dipole moment) with automatically constructed structural (spatial) features. Prediction based only on structural features already improves on the state-of-the-art predictive accuracies achieved in previous work with coarse-grained features. Accuracies are further improved when the combination of both feature categories is used. An important factor contributing to accurate prediction is that structural features are not Boolean but rather interpreted by counting the number of their occurences in a learning example.

Seeing the World through Homomorphism: An Experimental Study on Reducibility of Examples

  • Department: Department of Cybernetics
  • Annotation:
    We study reducibility of examples in several typical inductive logic programming benchmarks. The notion of reducibility that we use is related to theta-reduction, commonly used to reduce hypotheses in ILP. Whereas examples are usually not reducible on their own, they often become implicitly reducible when language for constructing hypotheses is fixed. We show that number of ground facts in a dataset can be almost halved for some real-world molecular datasets. Furthermore, we study the impact this has on a popular ILP system Aleph.

Shrinking Covariance Matrices using Biological Background Knowledge

  • Department: Department of Cybernetics
  • Annotation:
    We propose a novel method for covariance matrix estimation based on shrinkage with a target inferred from biological background knowledge using methods of inductive logic programming. We show that our novel method improves on the state of the art when sample sets are small and some background knowledge expressed in a subset of first-order logic is available. As we show in the experiments with genetic data, this background knowledge can be even only indirectly relevant to the modeling problem at hand.

Solving the Shortest Common Supersequence Problem Using Iterative Optimization with Evolved Hypermutations

  • Authors: Kubalík, J., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Proceedings of the Twentieth European Meeting on Cybernetics and Systems Research. Vienna: Austrian Society for Cybernetics Studies, 2010. p. 579-585. ISBN 978-3-85206-178-8.
  • Year: 2010
  • DOI: 10.1145/1569901.1569944
  • Link: https://doi.org/10.1145/1569901.1569944
  • Department: Department of Cybernetics
  • Annotation:
    This paper presents successful application of the POEMS algorithm to the Shortest Common Supersequence problem (SCSP). Series of experiments with POEMS on both the artificial as well as the real biological data sets have been carried out. Results presented in this paper show that POEMS is competitive with standard state-of-the-art heuristic algorithms for solving the SCS problem

Speeding Up Planning through Minimal Generalizations of Partially Ordered Plans

  • Authors: Černoch, R., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Inductive Logic Programming. Heidelberg: Springer, 2010. pp. 269-276. ISSN 0302-9743. ISBN 978-3-642-21294-9.
  • Year: 2010
  • DOI: 10.1007/978-3-642-21295-6_29
  • Link: https://doi.org/10.1007/978-3-642-21295-6_29
  • Department: Department of Cybernetics
  • Annotation:
    We present a novel strategy enabling to exploit existing plans in solving new similar planning tasks by finding a common generalized core of the existing plans. For this purpose we develop an operator yielding a minimal joint generalization of two partially ordered plans. In three planning domains we show a substantial speed-up of planning achieved when the planner starts its search space exploration from the learned common generalized core, rather than from scratch.

Taming the Complexity of Inductive Logic Programming (invited talk)

Using Constraint Satisfaction for Learning Hypotheses in Inductive Logic Programming

  • Department: Department of Cybernetics
  • Annotation:
    Inductive logic programming (ILP) is a subfield of machine learning which uses first-order logic as a uniform representation for examples, background knowledge and hypotheses (Muggleton and De Raedt, 1994). In this paper we deal with a so called template consistency problem, which is one of essential tasks in ILP (Gottlob et al 1999). In particular, given learning examples and template T, we are looking for a substitution s making Ts consistent with the examples using methods from the field of constraint satisfaction.

Advancing Data Mining Workflow Construction: A Framework and Cases using the Orange Toolkit

  • Authors: Žáková, M., Podpečan, V., prof. Ing. Filip Železný, Ph.D., Lavrač, N.
  • Publication: Third Generation Data Mining: Towards Service-Oriented Knowledge Discovery. Ljubljana: Institute Jozef Stefan, 2009. p. 39-51.
  • Year: 2009
  • Department: Department of Cybernetics
  • Annotation:
    The paper presents a framework for automatic construction of data mining workflows based on input and output specification of the data mining task. An ontology of data mining components is used for the formalization of the data mining task, knowledge types and algorithms. The ontology is then used for automated workflow construction using an algorithm that combines planning and ontological reasoning. The paper demonstrates how enhancing the classical planning with ontological reasoning can address some of the challenges of data mining workflow construction, including complex objects passed between the algorithms, constraints formulation and workflow presentation. The proposed methodology was tested in a case study annotating algorithms available in the Orange data mining toolkit and extending Orange to enable the execution of the generated data mining workflows consisting of algorithms available in Orange.

Block-Wise Construction of Acyclic Relational Features with Monotone Irreducibility and Relevancy Properties

  • Department: Department of Cybernetics
  • Annotation:
    We describe an algorithm for constructing a set of acyclic conjunctive relational features by combining smaller conjunctive blocks. Unlike traditional level-wise approaches which preserve the monotonicity of frequency, our block-wise approach preserves a form of monotonicity of the irreducibility and relevancy feature properties, which are important in propositionalization employed in the context of classification learning.

Cross-Genome Knowledge-Based Expression Data Fusion

  • Authors: Holec, M., doc. Ing. Jiří Kléma, Ph.D., prof. Ing. Filip Železný, Ph.D., Bělohradský, J., Tolar, J.
  • Publication: International Conference on Bioinformatics, Computational Biology, Genomics and Chemoinformatics (BCBGC-09). Orlando: International Society for Research in Science and Technology (ISRST), 2009, pp. 43-50. ISBN 978-1-60651-009-4.
  • Year: 2009
  • Department: Department of Cybernetics
  • Annotation:
    This paper presents the web tool XGENE.ORG which facilitates the integration of gene expression measurements with background genomic information, in particular the gene ontology and KEGG pathways. The novelty of the proposed data fusion is in the introduction of working units at different levels of generality acting as sample features, replacing the commonly used gene units, consequently al-lowing for cross-genome (multi-platform) expression data analysis. The integration of different microarray platforms contributes to the robustness of knowledge extracted when single-platform samples are rare and facilitates inference of biological knowledge not constrained to single organisms.

Gene Expression Mining Guided by Background Knowledge

  • DOI: 10.4018/978-1-60566-218-3.ch013
  • Link: https://doi.org/10.4018/978-1-60566-218-3.ch013
  • Department: Department of Cybernetics
  • Annotation:
    This chapter points out the role of genomic background knowledge in gene expression data mining. The authors demonstrate its application in several tasks such as relational descriptive analysis, constraint-based knowledge discovery, feature selection and construction or quantitative association rule mining. The chapter also accentuates diversity of background knowledge. In genomics, it can be stored in formats such as free texts, ontologies, pathways, links among biological entities and many others. The authors hope that understanding of automated integration of heterogeneous data sources helps researchers to reach compact and transparent as well as biologically valid and plausible results of their gene-expression data analysis.

Guest editors' introduction: Special issue on Inductive Logic Programming (ILP 2008)

  • DOI: 10.1007/s10994-009-5120-z
  • Link: https://doi.org/10.1007/s10994-009-5120-z
  • Department: Department of Cybernetics
  • Annotation:
    The 18th International Conference on Inductive Logic Programming was held in Prague, September 10-12, 2008. Apart from four invited talks and a tutorial, the presented peerreviewed papers consisted of 20 submissions accepted as long papers for the main technical track, and 21 papers accepted for the work-in-progress track. The long papers can be found in Volume 5194 of Springer's Lecture Notes in Artificial Intelligence series. Based on the reviews of the papers as well as their presentations at the conference, the ILP-2008 PC chairs invited the authors of 7 selected papers to submit a revised and significantly extended version for this special issue. As a result of an additional peer-review adopting the journal criteria, 5 of these papers were finally accepted.

Integrating Multiple Platform Expression Data through Gene Set Features

  • Authors: Holec, M., prof. Ing. Filip Železný, Ph.D., doc. Ing. Jiří Kléma, Ph.D., Tolar, J.
  • Publication: Bioinformatics Research and Applications. Heidelberg: Springer, 2009, pp. 5-17. Lecture Notes in Computer Science/Lecture Notes in Bioinformatics. ISSN 0302-9743. ISBN 978-3-642-01550-2. Available from: http://www.springerlink.com/content/e1wg776482u11q5m/?p=f43ab731d3d5432a887645b4f674a591&pi=1
  • Year: 2009

A Restarted Strategy for Efficient Subsumption Testing

  • Department: Department of Cybernetics
  • Annotation:
    We design a randomized restarted subsumption testing algorithm ReSumEr2 and test it on generated graph data as well as on the predictive toxicology challenge (PTC) data set.

Fast Estimation of First Order Clause Coverage through Randomization and Maximum Likelihood

  • Department: Department of Cybernetics
  • Annotation:
    In relational learning, {$\theta$}-subsumption and OI-subsumption are widely used coverage tests. Unfortunately, testing both {$\theta$}-subsumption and OI-subsumption is NP-complete, which is one of the reasons for poor performance of most relational learners. In this paper, we aim at a possible exploitation of randomized restarted search strategies for coverage estimation. We devise an algorithm that can, under a distributional assumption, estimate coverage between a given hypothesis and a set of examples without having to decide the subsumption test for all examples. We implement this algorithm in programs {\sc ReCovEr} and {\sc ReCovErOI}. On artificial graph data and real-world datasets, we show that these algorithms can provide reasonably accurate estimates while achieving favorable runtimes as compared to state-of-the-art algorithms.

Gene Expression Data Mining Guided by Genomic Background Knowledge

  • Department: Department of Cybernetics
  • Annotation:
    This paper deals with various method of gene expression data mining guided by genomic background knowledge, namely by the gene ontology.

HiFi: Tractable Propositionalization through Hierarchical Feature Construction

  • Department: Department of Cybernetics
  • Annotation:
    We present a novel propositionalization algorithm HiFi based on constructing first order features with hierarchical structure.

Learning Relational Descriptions of Differentially Expressed Gene Groups

  • Authors: Trajkovski, I., prof. Ing. Filip Železný, Ph.D., Lavrač, N., Tolar, J.
  • Publication: IEEE Transactions on Systems, Man, and Cybernetics: Part C. 2008, 38(1), 16-25. ISSN 1094-6977.
  • Year: 2008

Learning to Plan and Planning to Learn via Merging Relational Machine Learning with Constraint Satisfaction (extended abstract)

  • Department: Department of Cybernetics
  • Annotation:
    The purpose of this submission is to present our newly started research project entitled "LeCoS: Meeting Machine Learning with Constraint Satisfaction" concerned with exploring how to exploit relational machina learning (primarily inductive logic programming, ILP) to solve constraint satisfaction problems (promařily planning tasks), and vice versa.

Mining the Strongest Patterns in Medical Sequential Data

  • Department: Department of Cybernetics
  • Annotation:
    Sequential data represent an important source of automatically mined and potentially new medical knowledge. They can originate in various ways. Within the presented domain they come from a longitudinal preventive study of atherosclerosis - the data consists of series of long-term observations recording the development of risk factors and associated conditions. The intention is to identify frequent sequential patterns having any relation to an onset of any of the observed cardiovascular diseases. This paper focuses on application of inductive logic programming. The prospective patterns are based on first-order features automatically extracted from the sequential data. The features are further grouped in order to reach final complex patterns expressed as rules. The presented approach is also compared with the approaches published earlier (windowing, episode rules).

Planning to Learn with a Knowledge Discovery Ontology

  • Department: Department of Cybernetics
  • Annotation:
    This paper addresses the problem of semi automatic design of workflows for complex knowledge discovery tasks. Assembly of optimized knowledge discovery workflows requires awareness of and extensive knowledge about the principles and mutual relations between diverse data processing and mining algorithms. We aim at alleviating this burden by automatically proposing workflows for the given type of inputs and required outputs of the discovery process. The methodology adopted in this study is to define a formal conceptualization of knowledge types and data mining algorithms and design a planning algorithm, which extracts constraints from this conceptualization for the given user's input output requirements. We demonstrate our approach in two use cases, one from scientific discovery in genomics and another from advanced engineering

Predicting Gene Coexpression from Pathway Relations

  • Authors: Moulík, K., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Inductive Logic Programming, 18th International Conference, Late Breaking Papers. Praha: Agentura Action M, 2008, pp. 81-86. ISBN 978-80-86742-26-7.
  • Year: 2008
  • Department: Department of Cybernetics
  • Annotation:
    We address the problem of learning to predict the coexpression of gene pairs, given background knowledge consisting of the descriptions of biological pathways in which the genes are involved.

Sequential Data Mining: A Comparative Case Study in Development of Atherosclerosis Risk Factors

  • DOI: 10.1109/TSMCC.2007.906055
  • Link: https://doi.org/10.1109/TSMCC.2007.906055
  • Department: Department of Cybernetics
  • Annotation:
    Sequential data represent an important source of potentially new medical knowledge. However, this type of data is rarely provided in a format suitable for immediate application of conventional mining algorithms. This paper summarizes and compares three different sequential mining approaches, based respectively on windowing, episode rules and inductive logic programming. Windowing is one of the essential methods of data preprocessing, episode rules represent general sequential mining while inductive logic programming extracts first order features whose structure is determined by background knowledge. The three approaches are demonstrated and evaluated in terms of a case study STULONG. It is a longitudinal preventive study of atherosclerosis where the data consist of series of longterm observations recording the development of risk factors and associated conditions.

Using Bio-Pathways in Relational Learning

  • Department: Department of Cybernetics
  • Annotation:
    We compiled expression and pathway data related to a specific biological classification problem, into a relational database in Prolog, providing benchmarking material for ILP experimentation.

Using Ontological Reasoning and Planning for Data Mining Workflow Composition

  • Department: Department of Cybernetics
  • Annotation:
    This paper addresses the problem of semi automatic design of workflows for complex knowledge discovery tasks.

Exploiting Term, Predicate, and Feature Taxonomies in Propositionalization and Propositional Rule Learning

  • Authors: Žáková, M., prof. Ing. Filip Železný, Ph.D.,
  • Publication: Machine Learning 2007. Heidelberg: Springer, 2007. p. 798-805. ISSN 0302-9743. ISBN 978-3-540-74957-8.
  • Year: 2007
  • Department: Department of Cybernetics
  • Annotation:
    Knowledge representations using semantic web technologies often provide information which translates to explicit term and predicate taxonomies in relational learning. Here we show how to speed up the process of propositionalization of relational data by orders of magnitude, by exploiting such ontologies through a novel refinement operator used in the construction of conjunctive relational features. Moreover, we accelerate the subsequent search conducted by a propositional learning algorithm by providing it with information on feature generality taxonomy, determined from the initial term and predicate taxonomies but also accounting for traditional $\theta$ subsumption between features. This information enables the propositional rule learner to prevent the exploration of useless conjunctions containing a feature together with any of its subsumees and to specialize a rule by replacing a feature by its subsumee.

ILP Through Propositionalization and Stochastic k-Term DNF Learning

  • Authors: Paes, A., prof. Ing. Filip Železný, Ph.D., Zaverucha, G., Page, D., Srinivasan, A.
  • Publication: Inductive Logic Programming - 16th International Conference, ILP 2006, Santiago de Compostela, Spain, August 24-27, 2006, Revised Selected Papers. Heidelberg: Springer, 2007. p. 379-393. ISSN 0302-9743. ISBN 978-3-540-73846-6.
  • Year: 2007

Relational Data Mining Applied to Virtual Engineering of Product Designs

  • Authors: Žáková, M., prof. Ing. Filip Železný, Ph.D., Garcia-Sedano, J., Massia Tisot, C., Lavrač, N., Ing. Petr Křemen, Ph.D., Molina, J.
  • Publication: Inductive Logic Programming - 16th International Conference, ILP 2006, Santiago de Compostela, Spain, August 24-27, 2006, Revised Selected Papers. Heidelberg: Springer, 2007, pp. 439-453. ISSN 0302-9743. ISBN 978-3-540-73846-6.
  • Year: 2007

Relational Data Mining through Propositionalization and Subsequent Propositional Search for Semantic Virtual Engineering

  • Authors: Žáková, M., prof. Ing. Filip Železný, Ph.D., Ing. Petr Křemen, Ph.D., Tissot, C.M., Lavrač, N.
  • Publication: KCAP 2007 Workshop on Knowledge Management and Semantic Web for Engineering Design (KW4ED'07). New York: ACM, 2007, pp. 49-58. ISBN 978-1-59593-643-1.
  • Year: 2007
  • Department: Department of Cybernetics
  • Annotation:
    Contemporary product design based on 3D CAD tools aims at improved effciency using integrated engineering environments with access to databases of existing designs, associated documents and enterprize resource planning. The project SEVENPRO addresses this issue by developing a framework for integration of the different information sources through a layer of semantic annotations and tools for enhancing management and reuse of engineering knowledge. The discovery of implicit knowledge by means of relational data mining is expected to be an important facility for reusing of engineering knowledge. Therefore we are developing a system for relational data mining, which could be seamlessly integrated into the semantic engineering environment.

Relational Pattern and Subgroup Discovery in CAD Documents

  • Authors: Žáková, M., Ing. Petr Křemen, Ph.D., prof. Ing. Filip Železný, Ph.D., Sedano, J.A., Tisot, C.M., Lavrač, N., Molina, J.
  • Publication: Znalosti 2007 Sborník příspěvků 6. ročníku konference. Ostrava: VŠB - Technical University of Ostrava, 2007, pp. 276-287. ISBN 978-80-248-1279-3.
  • Year: 2007
  • Department: Department of Cybernetics
  • Annotation:
    This paper applies the methodology of subgroup discovery which first seeks frequent structural patterns in CAD designs and then discovers interesting (large and class-biased) design subgroups defined in terms of the frequent patterns its members contain. Besides the two mentioned kinds of data mining results, further categories of patterns will be required to mine in the present application domain. The large amount of generated results of each kind calls for establishing an efficient pattern management and search functionality. For this purpose we elaborate an ontology of relational data mining results.

ILP through Propositionalization and Stochastic k-term DNF Learning

  • Authors: Paes, A., prof. Ing. Filip Železný, Ph.D., Zaveruch, G., Page, D., Srinivasan, A.
  • Publication: 16th International Conference on Inductive Logic Programming. Corunna: University of Corunna, 2006. p. 164-166. ISBN 84-9749-206-4.
  • Year: 2006

Propositionalization based relational subgroup discovery with RSD

  • DOI: 10.1007/s10994-006-5834-0
  • Link: https://doi.org/10.1007/s10994-006-5834-0
  • Department: Department of Cybernetics
  • Annotation:
    Relational rule learning algorithms are typically designed to construct classification and prediction rules. However, relational rule learning can be adapted also to subgroup discovery. This paper proposes a propositionalization approach to relational subgroup discovery, achieved through appropriately adapting rule learning and first-order feature construction. The proposed approach was successfully applied to standard ILP problems (East-West trains, King-Rook-King chess endgame and mutagenicity prediction) and two real-life problems (analysis of telephone calls and traffic accident analysis).

Randomised Restarted Search in ILP

  • DOI: 10.1007/s10994-006-7733-9
  • Link: https://doi.org/10.1007/s10994-006-7733-9
  • Department: Department of Cybernetics
  • Annotation:
    Recent statistical performance studies of search algorithms in difficult combinatorial problems have demonstrated the benefits of randomising and restarting the search procedure. Specifically, it has been found that if the search cost distribution of the non-restarted randomised search exhibits a slower-than-exponential decay (that is, a ``heavy tail''), restarts can reduce the search cost expectation. We report on an empirical study of randomised restarted search in ILP. Our experiments conducted on a high-performance distributed computing platform provide an extensive statistical performance sample of five search algorithms operating on two principally different classes of ILP problems, one represented by an artificially generated graph problem and the other by three traditional classification benchmarks (mutagenicity, carcinogenicity, finite element mesh design).

Relational Data Mining Applied to Virtual Engineering of Product Designs

  • Authors: Žáková, M., prof. Ing. Filip Železný, Ph.D., Garcia-Sedano, J., Masia Tissot, C., Lavrac, N., Ing. Petr Křemen, Ph.D., Molina, J.
  • Publication: 16th International Conference on Inductive Logic Programming. Corunna: University of Corunna, 2006, pp. 231-233. ISBN 84-9749-206-4.
  • Year: 2006
  • Department: Department of Cybernetics
  • Annotation:
    Contemporary product design based on 3D CAD tools aims at improved efficiency using integrated engineering environments with access to databases of existing designs, associated documents and enterprise resource planning (ERP). One of the goals of the SEVENPRO project is to achieve design process improvements through the utilization of relational data mining (RDM), utilizing past designs and commonly agreed design ontologies. In this paper we discuss the applicability of state-of-the-art ILP systems to the RDM tasks in this application domain as well as the implied challenges to ILP research and system implementations.

Relational Descriptive Analysis of Gene Expression Data

  • Authors: Trajkovski, I., prof. Ing. Filip Železný, Ph.D., Tolar, J., Lavrač, N.
  • Publication: STAIRS 2006 - Proceedings of the Third Starting AI Researchers' Symposium. Oxford: IOS Press, 2006. p. 184-194. ISBN 1-58603-645-9.
  • Year: 2006
  • Department: Department of Cybernetics
  • Annotation:
    This paper presents a method that uses gene ontologies together with the paradigm of relational subgroup discovery, to help find description of groups of genes differentially expressed in specific cancers. The descriptions are represented by means of relational features, extracted from publicly available gene ontology information, and are straightforwardly interpretable by the medical experts. We applied the proposed method to two known data sets: (i) acute lymphoblastic leukemia (ALL) vs. acute myeloid leukemia (AML) and (ii) classification of fourteen types of cancer. Significant number of discovered groups of genes had a description, confirmed by the medical expert, which highlighted the underlying biological process that is responsible for distinguishing one class from the other classes.

Relational Subgroup Discovery for Descriptive Analysis of Microarray Data

  • Authors: Trajkovski, I., prof. Ing. Filip Železný, Ph.D., Tolar, J., Lavrac, N.
  • Publication: Computational Life Sciences II. Heidelberg: Springer, 2006. p. 86-96. ISSN 0302-9743. ISBN 3-540-45767-4.
  • Year: 2006
  • Department: Department of Cybernetics
  • Annotation:
    This paper presents a method that uses gene ontologies, together with the paradigm of relational subgroup discovery, to help find description of groups of genes differentially expressed in specific cancers.

Summarizing gene-expression-based classifiers by meta-mining comprehensible relational patterns

  • Authors: prof. Ing. Filip Železný, Ph.D., Štěpánková, O., Tolar, J., Lavrač, N.
  • Publication: Proceedings of the Fourth IASTED International Conference on BIOMEDICAL ENGINEERING. Zürich: Acta Press, 2006. p. 19-24. ISBN 0-88986-576-0.
  • Year: 2006
  • Department: Department of Cybernetics
  • Annotation:
    We propose a methodology for predictive classification from gene expression data, able to combine the robustness of high-dimensional statistical classification methods with the comprehensibility and interpretability of simple logic-based models. We first construct a robust classifier combining contributions of a large number of gene expression values, and then (meta)-mine the classifier for compact summarizations of subgroups among genes associated with a given class therein. The subgroups are described by means of relational logic features extracted from publicly available gene ontology information. The curse of dimensionality pertaining to the gene expression based classification problem due to the large number of attributes (genes) is turned into an advantage in the secondary, meta-mining task as here the original attributes become learning examples.

Towards Adaptive Problem Solves

An Abstract Architecture for Computational Reflection in Multi-Agent Systems

  • Department: Department of Cybernetics
  • Annotation:
    We propose a general framework for computational reflection in multi-agent systems and address some technical issues related to its implementation. Potentials of computational models of cognition and reflection in a multi-agent system are detailed, and using such models, an abstract architecture of a reflective agent is designed. We also propose important characteristics of reflective multi-agent systems build upon the presented architecture.

Analysis of Genomic Data By Machine Learning Algorithms

Artificial intelligence based inference of knowledge from gene expression microarray data

  • Department: Department of Cybernetics
  • Annotation:
    We address the potentials and state-of-the-art achievements in applying artificial intelligence methods to automatically infer useful knowledge from data produced by the gene expression microarray technology

Artificial Intelligence Methods for Gene Expression Data Analysis

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: Analytická cytometrie III. Brno: Česká společnost pro analytickou cytologii, 2005, pp. 29-30. ISBN 80-239-5155-6.
  • Year: 2005
  • Department: Department of Cybernetics
  • Annotation:
    A review of artificial intelligence methods for gene expression data analysis

Efficient Construction of Relational Features

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: 4th International Conference on Machine Learning and Applications - Proceedings. Los Alamitos: IEEE Computer Society Press, 2005. p. 259-264. ISBN 0-7695-2495-8.
  • Year: 2005
  • Department: Department of Cybernetics
  • Annotation:
    Devising algorithms for learning from multi-relational data is currently considered an important challenge. The wealth of traditional single-relational machine learning tools, on the other hand, calls for methods of `propositionalization', ie. Conversion of multi-relational data into single-relational representations. A major stream of propositionalization algorithms is based on the construction of truth-valued features (first-order logic atom conjunctions), which capture relational properties of data and play the role of binary attributes in the resulting single-table representation. Such algorithms typically use backtrack depth first search for the syntactic construction of features complying to user's mode/type declarations. As such they incur a complexity factor exponential in the maximum allowed feature size.

Efficient Sampling in Relational Feature Spaces

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: Inductive Logic Programming. Berlin: Springer, 2005. p. 397-413. ISSN 0302-9743. ISBN 3-540-28177-0.
  • Year: 2005
  • Department: Department of Cybernetics
  • Annotation:
    State-of-the-art algorithms implementing the `extended transformation approach' to propositionalization use backtrack depth first search for the construction of relational features (first order atom conjunctions) complying to user's mode/type declarations and a few basic syntactic conditions. As such they incur a complexity factor exponential in the maximum allowed feature size. Here I present an alternative based on an efficient reduction of the feature construction problem on the propositional satisfiability (SAT) problem, such that the latter involves only Horn clauses and is therefore tractable: a model to a propositional Horn theory can be found without backtracking in time linear in the number of literals contained.

How Computers Discover How Computers Discover (a review of algorithmic meta-discovery)

  • Authors: prof. Ing. Filip Železný, Ph.D.,
  • Publication: Interdisciplinary Aspects of Human-Machine Co-existence and Co-operation. Praha: Vydavatelství ČVUT, 2005, pp. 175-189. ISBN 80-01-03275-2.
  • Year: 2005
  • Department: Department of Cybernetics
  • Annotation:
    In traditional exerimental sciences one forms hypotheses explaining an empirical sample of some natural phenomenon. The hypothesis discovery process can in certain cases be automated using machine learning algorithms. I discuss some consequences of viewing scientific discovery, primarily computer based, itself as a natural phenomenon, whose empirical sample may be analyzed or hypothesized about. This viewpoint frames some recent findings about the statistical properties of computationally intensive discovery processes (phase transitions in complexity of learning, heavy-tailed runtime distributions of hypothesis search), empirical assessments of the Occam's razor principle, as well as the issue of meta learning (such as `discovering how to discover best'). Here I review these points and conclude with a few speculative thoughts.

Mining the Strongest Patterns in Medical Sequential Data

  • Department: Department of Cybernetics
  • Annotation:
    Sequential data represent an important source of automatically mined and potentially new medical knowledge. They can originate in various ways. Within the presented domain they come from a longitudinal preventive study of atherosclerosis - the data consist of series of long-term observations recording the development of risk factors and associated conditions. The intention is to identify frequent sequential patterns having any relation to an onset of any of the observed cardiovascular diseases. This paper focuses on application of inductive logic programming. The prospective patterns are based on first-order features automatically extracted from the sequential data. The features are further grouped in order to reach final complex patterns expressed as rules. The presented approach is also compared with the approaches published earlier (windowing, episode rules).

Modelling of Agents' Behavior with Semi-Collaborative Meta-Agents

  • Department: Department of Cybernetics
  • Annotation:
    An autonomous agent may largely benefit from its ability to reconstruct another agent's reasoning principles from records of past events and general knowledge about the world. In our approach, the meta-agent maintains a first-order logic theory, called the community model, yielding predictions about other agents' decisions. In this contribution we introduce a query-based collective reasoning process where the semi-collaborative meta-agents use active learning technique to improve their models. We provide empirical results that demonstrate the viability of the concept and show the benefits of collective meta-reasoning.

Relational Subgroup Discovery for Gene Expression Data Mining

  • Authors: prof. Ing. Filip Železný, Ph.D., Tolar, J., Lavrač, N., Štěpánková, O.
  • Publication: The 3rd European Medical and Biological Engineering Conference - EMBEC´05. Praha: Společnost biomedicínského inženýrství a lékařské informatiky ČLS JEP, 2005, ISSN 1727-1983.
  • Year: 2005
  • Department: Department of Cybernetics
  • Annotation:
    We propose a methodology for predictive classification from gene expression data, able to combine the robustness of high-dimensional statistical classification methods with the comprehensibility and interpretability of simple logic-based models. We first construct a robust classifier combining contributions of a large number of gene expression values, and then search for compact summarizations of subgroups among genes associated in the classifier with a given class. The subgroups are described by means of relational logic features extracted from publicly available gene annotations. The curse of dimensionality pertaining to the gene expression based classification problem due to the large number of attributes (genes) is turned into an advantage in the secondary subgroup discovery task, as here the original attributes become learning examples.

Tractable Construction of Relational Features

  • Department: Department of Cybernetics
  • Annotation:
    A popular technique for converting multi-relational data into a single-relational form is based on constructing truth-valued relational features of the data instances, where the features play the role of binary attributes in the resulting representation. Here I consider a simple relational feature language whose formulas correspond to conjunctions of first-order atoms where arguments are only variables and respect user-defined constraints on types and input/output modes. I show a sufficient condition for polynomial time construction of such formulas and give preliminary results on tractable enumeration of complete sets of such formulas.

A Monte Carlo Study of Randomised Restarted Search in ILP

  • Authors: prof. Ing. Filip Železný, Ph.D., Srinivasan, A., Page, D.
  • Publication: Inductive Logic Programming. Berlin: Springer, 2004. p. 341-358. ISSN 0302-9743. ISBN 3-540-22941-8.
  • Year: 2004
  • Department: Department of Cybernetics
  • Annotation:
    Recent statistical performance surveys of search algorithms in difficult combinatorial problems have demonstrated the benefits of randomising and restarting the search procedure. Specifically, it has been found that if the search cost distribution (SCD) of the non-restarted randomised search exhibits a slower-than-exponential decay (that is, a ``heavy tail''), restarts can reduce the search cost expectation. Recently, this heavy tail phenomenon was observed in the SCD's of benchmark ILP problems. % \cite{?}. Following on this work, we report on an empirical study of randomised restarted search in ILP. Our experiments, conducted over a cluster of a few hundred computers, provide an extensive statistical performance sample of five search algorithms operating on two principally different ILP problems (artificially generated graph data and the well-known ``mutagenesis'' problem).

Efficiency-conscious Propositionalization for Relational Learning

  • Department: Department of Cybernetics
  • Annotation:
    Systems aiming at discovering interesting knowledge in data, now commonly called data mining systems, are typically employed in finding patterns in a single relational table. Most of mainstream data mining tools are not applicable in the more challenging task of finding knowledge in structured data represented by a multi-relational database. Although a family of methods known as inductive logic programming have been developed to tackle that challenge by immediate means, the idea of adapting structured data into a simpler form digestible by the wealth of AVL systems has been always tempting to data miners. To this end, we present a method based on constructing first-order logic features that conducts this kind of conversion, also known as propositionalization. It incorporates some basic principles suggested in previous research and provides significant enhancements that lead to remarkable improvements in efficiency of the feature- construction process.

Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

  • Authors: Gamberger, D., Lavrač, N., prof. Ing. Filip Železný, Ph.D., Tolar, J.
  • Publication: 2nd EUNITE Workshop on Intelligent Technologies for GeneExpression Based Individualized Medicine. Jena: BioControl Jena GmbH, 2004, pp. 269-284.
  • Year: 2004
  • Department: Department of Cybernetics
  • Annotation:
    Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by an especially high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insight they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, we show the feasibility of this approach, employing a recently developed subgroup discovery methodology.

Induction of Comprehensible Models for Gene Expression Datasets by Subgroup Discovery Methodology

  • Authors: Gamberger, D., Lavrač, N., prof. Ing. Filip Železný, Ph.D., Tolar, J.
  • Publication: Journal of Biomedical Informatics. 2004, 37(4), 269-284. ISSN 1532-0464.
  • Year: 2004
  • Department: Department of Cybernetics
  • Annotation:
    Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by an especially high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insight they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, we show the feasibility of this approach, employing a recently developed subgroup discovery methodology.

Local Patterns: Theory and Practice of Constraint-Based Relational Subgroup Discovery

  • Authors: Lavrač, N., prof. Ing. Filip Železný, Ph.D., Džeroski, S.
  • Publication: Local Pattern Detection. Berlin: Springer, 2004. p. 71-88. ISBN 3-540-26543-0.
  • Year: 2004
  • Department: Department of Cybernetics
  • Annotation:
    The paper deals with the theory and practice of constraint-based relational subgroup discovery

Comparative Evaluation of Approaches to Propositionalization

  • Authors: Krogel, M.A., Rawles, S., prof. Ing. Filip Železný, Ph.D., Flach, P., Lavrač, N., Wrobel, S.
  • Publication: Inductive Logic Programming, 13th International Conference (ILP 2003). Berlin: Springer, 2003. p. 197-214. ISBN 3-540-20144-0.
  • Year: 2003

Constraint-Based Relational Subgroup Discovery

  • Authors: prof. Ing. Filip Železný, Ph.D., Lavrač, N., Džeroski, S.
  • Publication: Proceedings of the 2nd International Workshop on Multi-Relational Data Mining (MRDM 2003). Washington, DC: ACM SIGKDD, 2003, pp. 135-150.
  • Year: 2003

Data-Mining and Decision-Support Systems Integration

  • Department: Department of Cybernetics
  • Annotation:
    The project tries to integrate two modern fields: data mining and decision support. The project aim is to propose and standardize formats for data exchange and sharing between DM and DS systems as well as to suggest and test tools supporting integration. The paper focuses on the two case studies giving practical examples of such an integration.

Efficient transformation of multi-relational databases into single-relational representations

  • Authors: prof. Ing. Filip Železný, Ph.D., Štěpánková, O.
  • Publication: Znalosti 2003 - sborník příspěvků 2. ročníku konference. Ostrava: VŠB-TUO, 2003, pp. 242-251. ISBN 80-248-0229-5.
  • Year: 2003
  • Department: Department of Cybernetics
  • Annotation:
    We describe an implementation of a system capable of producing a single-relational representation of a multi-relational database, thereby allowing for the exploitation of classical datamining algorithms even for multi-relational problems.

Lattice-Search Runtime Distributions May Be Heavy-Tailed

  • Authors: prof. Ing. Filip Železný, Ph.D., Srinivasan, A., Page, D.
  • Publication: Inductive Logic Programming. Düsseldorf: Springer VDI Verlag, 2003. p. 333-345. ISBN 3-540-00567-6.
  • Year: 2003
  • Department: Department of Cybernetics
  • Annotation:
    The paper shows that the runtime distribution of lattice searches conducted by an ILP algorithm when solving two example important learning problems are heavy-tailed, ie.they decay slower than exponentially. This observation is exploited by an adaptation of the method of rapid randomized restarts towards a significant reduction of the average runtime

Machine Learning Algorithms for Data Mining

  • Department: Department of Cybernetics
  • Annotation:
    The chapter describes machine learning algorithms often applied in predictive and descriptive data mining. It introduces data mining as a process and discusses its phases (data understanding and pre-processing, evaluation, etc.). Several specific data mining tasks are also described (time series, text mining, multi-relational data).

Mining Interesting Patterns

  • Department: Department of Cybernetics
  • Annotation:
    Data mining is an area that grew into recognizable scientific and engineering discipline through nineties. The article gives a general overview of the area and briefly describes selected mining techniques

RSD: Relational Subgroup Discovery through First-order Feature Construction

  • Authors: Lavrac, N., prof. Ing. Filip Železný, Ph.D., Flach, P.
  • Publication: Inductive Logic Programming. Düsseldorf: Springer VDI Verlag, 2003. p. 149-165. ISBN 3-540-00567-6.
  • Year: 2003
  • Department: Department of Cybernetics
  • Annotation:
    The paper presents a methodology for relational subgroup discovery. First a set of first-order logic features is constructed that is used as the data attribute set. Then a propositional rule learning algorithm is used, adapted to suit the subgroup discovery interests

Using Constraints in Relational Subgroup Discovery

  • Authors: prof. Ing. Filip Železný, Ph.D., Lavrač, N., Džeroski, S.
  • Publication: International Conference on Methodology and Statistics. University of Ljubljana, 2003, pp. 78-81. ISBN 961-235-128-7.
  • Year: 2003

A Learning System for Decision Support in Telecommunications

  • Authors: prof. Ing. Filip Železný, Ph.D., Štěpánková, O., Zídek, J.
  • Publication: Soft-Ware 2002: Computing in an Imperfect World. Berlin: Springer, 2002. p. 88-101. ISBN 3-540-43481-X.
  • Year: 2002
  • Department: Department of Cybernetics
  • Annotation:
    overview of a project on applications of intelligent methods for the optimization of a switchboard operation

Data mining and decision support in telecommunications

  • Authors: prof. Ing. Filip Železný, Ph.D., Zídek, J.
  • Publication: Intelligent Methods for Quality Improvement in Industrial Practice. Praha: ČVUT FEL, Katedra kybernetiky - Gerstnerova laboratoř, 2002, pp. 107-114. ISSN 1213-3000.
  • Year: 2002
  • Department: Department of Cybernetics
  • Annotation:
    overview of a project on applications of intelligent methods for the optimization of a switchboard operation

Data Mining for Decision Support in Marketing: A Case Study in Targeting a Marketing Campaign

  • Authors: Cestnik, B., Lavrač, N., prof. Ing. Filip Železný, Ph.D., Gamberger, D., Todorovski, L., Kline, M.
  • Publication: Workshop Proceedings ECML/PKDD 2002. Helsinki: Helsinki University of Technology, 2002, pp. 25-34. ISBN 952-10-0634-X.
  • Year: 2002

SumatraTT: Towards a Universal Data Preprocessor

  • Authors: Aubrecht, P., prof. Ing. Filip Železný, Ph.D., Mikšovský, P., Štěpánková, O.
  • Publication: Cybernetics and Systems 2002, volume II. Vienna: Austrian Society for Cybernetics Studies, 2002, pp. 818-823. ISBN 3-85206-160-1.
  • Year: 2002
  • Department: Department of Cybernetics
  • Annotation:
    A review on the current state-of-art and perspectives on the development of the Sumatra transformation tool

Lattice-Search Runtime Distributions May Be Heavy-Tailed

  • Authors: prof. Ing. Filip Železný, Ph.D., Srinivasan, A., Page, D.
  • Publication: Inductive Logic Programming. Berlin: Springer, 2001, pp. 333-345. ISBN 3-540-42538-1.
  • Year: 2001

Learning Functions from Imperfect Positive Data

Progress in SumatraTT: ILP Connectivity and More New Features

  • Authors: Aubrecht, P., prof. Ing. Filip Železný, Ph.D., Mikšovský, P., Štěpánková, O.
  • Publication: Information Society IS 2001. Ljubljana: Institut Jozef Stefan, 2001, pp. 107-110. ISBN 961-6303-34-1.
  • Year: 2001

Relational Subgroup Discovery through First-order Feature Construction

  • Authors: Lavrac, N., prof. Ing. Filip Železný, Ph.D., Flach, P.
  • Publication: Inductive Logic Programming. Berlin: Springer, 2001, pp. 149-165. ISBN 3-540-42538-1.
  • Year: 2001

The Scope and Limitations of Inductive Logic Programming

  • Authors: prof. Ing. Filip Železný, Ph.D., Mikšovský, P., Štěpánková, O., Zídek, J.
  • Publication: Proceedings of Workshop 2001. Praha: České vysoké učení technické v Praze, 2001, pp. 280-281. ISBN 80-01-02335-4.
  • Year: 2001

Big General Encyklopedia (a-bag)

Big General Encyklopedia (bah-bug)

Big General Encyklopedia (bur-doz)

ILP for Automated Telephony

  • Authors: prof. Ing. Filip Železný, Ph.D., Mikšovský, P., Štěpánková, O., Zídek, J.
  • Publication: Inductive Logic Programming Work-in-Progress Reports. London: Imperial College Press, 2000, pp. 276-286.
  • Year: 2000

KDD in Telecommunications

  • Authors: prof. Ing. Filip Železný, Ph.D., Mikšovský, P., Štěpánková, O., Zídek, J.
  • Publication: DDMI 2000 Workshop. Porto: Universidade de Porto, 2000, pp. 103-112.
  • Year: 2000

DNA Computing

Financial Data Challenge

DNA Computing

Responsible person Ing. Mgr. Radovan Suk