prof. Ing. Václav Šmídl, Ph.D.

Deep anomaly detection on set data: Survey and comparison

Autoři: Ing. Michaela Mašková, Ing. Matěj Zorek, doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Pattern recognition. 2024, 151 ISSN 0031-3203.
Rok: 2024

DOI: 10.1016/j.patcog.2024.110381
Odkaz: https://doi.org/10.1016/j.patcog.2024.110381
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Detecting anomalous samples in set data is a problem attracting increased interest due to novel modalities, such as point-cloud data produced by lidars. Novel methods including those based on deep neural networks are often tuned for a single purpose prohibiting intuition of how relevant they are for another purpose or application domains. The aim of this survey is to: (i) review elementary concepts of anomaly detection of set data, (ii) identify the building blocks of deep anomaly detectors, and (iii) analyze the impact of these blocks on performance. The impact is studied in a large experimental comparison on a variety of benchmark datasets. The results reveal that the main factor determining the performance is the type of anomalies in the dataset. While deep methods embedding the whole set to a single fixed vector perform well on point cloud data, the methods embedding each feature vector independently are better for datasets from multi-instance learning. Moreover, sophisticated methods utilizing transformer blocks are frequently inferior to simple models with properly optimized hyperparameters. An independent factor in performance is the cardinality of sets, the proper treatment of which remains an open problem, as the existing analytical solution was found to be inaccurate.

Malicious Internet Entity Detection Using Local Graph Inference

Autoři: Mandlík, Š., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D., Bajer, L.
Publikace: IEEE Transactions on Information Forensics and Security. 2024, 19 3554-3566. ISSN 1556-6013.
Rok: 2024

DOI: 10.1109/TIFS.2024.3360867
Odkaz: https://doi.org/10.1109/TIFS.2024.3360867
Pracoviště: Centrum umělé inteligence
Anotace:
Detection of malicious behavior in a large network is a challenging problem for machine learning in computer security, since it requires a model with high expressive power and scalable inference. Existing solutions struggle to achieve this feat—current cybersec-tailored approaches are still limited in expressivity, and methods successful in other domains do not scale well for large volumes of data, rendering frequent retraining impossible. This work proposes a new perspective for learning from graph data that is modeling network entity interactions as a large heterogeneous graph. High expressivity of the method is achieved with neural network architecture HMILnet that naturally models this type of data and provides theoretical guarantees. The scalability is achieved by pursuing local graph inference, i.e., classifying individual vertices and their neighborhood as independent samples. Our experiments exhibit improvement over the state-of-the-art Probabilistic Threat Propagation (PTP) algorithm, show a further threefold accuracy improvement when additional data is used, which is not possible with the PTP algorithm, and demonstrate the generalization capabilities of the method to new, previously unseen entities.

Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs

Autoři: Ing. Milan Papež, Ph.D., Ing. Martin Rektoris, prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceeding The Twelfth International Conference on Learning Representations (ICLR 2024). Waset.org: World Academy of Science, Engineering and Technology, 2024. ISBN 9781713898658.
Rok: 2024

Pracoviště: Centrum umělé inteligence
Anotace:
Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

Active Learning Efficiency Benchmark for Coreference Resolution Including Advanced Uncertainty Representations

Autoři: Sahan, M., prof. Ing. Václav Šmídl, Ph.D., Watanabe, T., Ing. Radek Mařík, CSc.,
Publikace: 2023 2nd International Conference on Frontiers of Communications, Information System and Data Science. Los Alamitos: IEEE Computer Society, 2023. p. 40-47. ISBN 979-8-3503-8147-4.
Rok: 2023

DOI: 10.1109/CISDS61173.2023.00016
Odkaz: https://doi.org/10.1109/CISDS61173.2023.00016
Pracoviště: Katedra telekomunikační techniky, Centrum umělé inteligence
Anotace:
Active learning is a powerful technique that accelerates model learning by iteratively expanding training data based on the model’s feedback. This approach has proven particularly relevant in natural language processing and other machine learning domains. While active learning has been extensively studied for conventional classification tasks, its application to more specialized tasks like neural coreference resolution has the potential for improvement. In our research, we present a significant advancement by applying active learning to the neural coreference problem, and setting a benchmark of 39% reduction in required annotations for training data. Simultaneously, it preserves performance compared to the original model trained on the full data. We compare various uncertainty sampling techniques along with Bayesian modifications of coreference resolution models, conducting a comprehensive analysis of annotation efforts. The results demonstrate that the best-performing techniques seek to maximize label annotation in previously chosen documents, showcasing their effectiveness and preserving performance.

Sum-Product-Set Networks

Autoři: Ing. Milan Papež, Ph.D., Ing. Martin Rektoris, doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Proceedings of the 6th Workshop on Tractable Probabilistic Modeling. Massachusetts: OpenReview.net / University of Massachusetts, 2023.
Rok: 2023

Pracoviště: Centrum umělé inteligence
Anotace:
Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

Batch Active Learning for Text Classification and Sentiment Analysis

Autoři: Sahan, M., prof. Ing. Václav Šmídl, Ph.D., Ing. Radek Mařík, CSc.,
Publikace: CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System. New York: Association for Computing Machinery, 2022. p. 111-116. ISBN 978-1-4503-9685-1.
Rok: 2022

DOI: 10.1145/3562007.3562028
Odkaz: https://doi.org/10.1145/3562007.3562028
Pracoviště: Katedra telekomunikační techniky, Centrum umělé inteligence
Anotace:
Supervised learning of classifiers for text classification and sentiment analysis relies on the availability of labels that may be either difficult or expensive to obtain. A standard procedure is to add labels to the training dataset sequentially by querying an annotator until the model reaches a satisfactory performance. Active learning is a process that optimizes unlabeled data records selection for which the knowledge of the label would bring the highest discriminability of the dataset. Batch active learning is a generalization of a single instance active learning by selecting a batch of documents for labeling. This task is much more demanding because plenty of different factors come into consideration (i. e. batch size, batch evaluation, etc.). In this paper, we provide a large scale study by decomposing the existing algorithms into building blocks and systematically comparing meaningful combinations of these blocks with a subsequent evaluation on different text datasets. While each block is known (warm start weights initialization, Dropout MC, entropy sampling, etc.), many of their combinations like Bayesian strategies with agglomerative clustering are first proposed in our paper with excellent performance. Particularly, our extension of the warm start method to batch active learning is among the top performing strategies on all datasets. We studied the effect of this proposal comparing the outcomes of varying distinct factors of an active learning algorithm. Some of these factors include initialization of the algorithm, uncertainty representation, acquisition function, and batch selection strategy. Further, various combinations of these are tested on selected NLP problems with documents encoded using RoBERTa embeddings. Datasets cover context integrity (Gibberish Wackerow), fake news detection (Kaggle Fake News Detection), categorization of short texts by emotional context (Twitter Sentiment140), and sentiment classification (Amazon Reviews). Ultimately, we show that each of the active learning factors has advantages for certain datasets or experimental settings.

Comparison of Anomaly Detectors: Context Matters

Autoři: Škvára, V., Franca, J., Ing. Matěj Zorek, prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Neural Networks and Learning Systems. 2022, 33(6), 2494-2507. ISSN 2162-2388.
Rok: 2022

DOI: 10.1109/TNNLS.2021.3116269
Odkaz: https://doi.org/10.1109/TNNLS.2021.3116269
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Deep generative models are challenging the classical methods in the field of anomaly detection nowadays. Every newly published method provides evidence of outperforming its predecessors, sometimes with contradictory results. The objective of this article is twofold: to compare anomaly detection methods of various paradigms with a focus on deep generative models and identification of sources of variability that can yield different results. The methods were compared on popular tabular and image datasets. We identified that the main sources of variability are the experimental conditions: 1) the type of dataset (tabular or image) and the nature of anomalies (statistical or semantic) and 2) strategy of selection of hyperparameters, especially the number of available anomalies in the validation set. Methods perform differently in different contexts, i.e., under a different combination of experimental conditions together with computational time. This explains the variability of the previous results and highlights the importance of careful specification of the context in the publication of a new method. All our code and results are available for download.

General framework for binary classification on top samples

Autoři: Adam, L., Mácha, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Optimization Methods and Software. 2022, 37(5), 1636-1667. ISSN 1029-4937.
Rok: 2022

DOI: 10.1080/10556788.2021.1965601
Odkaz: https://doi.org/10.1080/10556788.2021.1965601
Pracoviště: Centrum umělé inteligence
Anotace:
Many binary classification problems minimize misclassification above (or below) a threshold. We show that instances of ranking problems, accuracy at the top, or hypothesis testing may be written in this form. We propose a general framework to handle these classes of problems and show which formulations (both known and newly proposed) fall into this framework. We provide a theoretical analysis of this framework and mention selected possible pitfalls the formulations may encounter. We show the convergence of the stochastic gradient descent for selected formulations even though the gradient estimate is inherently biased. We suggest several numerical improvements, including the implicit derivative and stochastic gradient descent. We provide an extensive numerical study.

Reducing the cost of fitting mixture models via stochastic sampling

Autoři: Ing. Milan Papež, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Proceedings of the 5th Workshop on Tractable Probabilistic Modeling. Eindhoven: Eindhoven University of Technology, 2022.
Rok: 2022

Pracoviště: Centrum umělé inteligence
Anotace:
Traditional methods for unsupervised learning of finite mixture models require to evaluate the likelihood of all components of the mixture. This quickly becomes prohibitive when the components are abundant or expensive to compute. Therefore, we propose to apply a combination of the expectation maximization and the Metropolis-Hastings algorithm to evaluate only a small number of, stochastically sampled, components, thus substantially reducing the computational cost. The Markov chain of component assignments is sequentially generated across the algorithm's iterations, having a non-stationary target distribution whose parameters vary via a gradient-descent scheme. We put emphasis on generality of our method, equipping it with the ability to train mixture models which involve complex, and possibly nonlinear, transformations. The performance of our method is illustrated on mixtures of normalizing flows.

Semi-supervised deep networks for plasma state identification

Autoři: Ing. Matěj Zorek, Škvára, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., Seidl, J., Grover, O.
Publikace: Plasma Physics and Controlled Fusion. 2022, 64(12), 1-16. ISSN 1361-6587.
Rok: 2022

DOI: 10.1088/1361-6587/ac9926
Odkaz: https://doi.org/10.1088/1361-6587/ac9926
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Correct and timely detection of plasma confinement regimes and edge localized modes (ELMs) is important for improving the operation of tokamaks. Existing machine learning approaches detect these regimes as a form of post-processing of experimental data. Moreover, they are typically trained on a large dataset of tens of labeled discharges, which may be costly to build. We investigate the ability of current machine learning approaches to detect the confinement regime and ELMs with the smallest possible delay after the latest measurement. We also demonstrate that including unlabeled data into the training process can improve the results in a situation where only a limited set of reliable labels is available. All training and validation is performed on data from the COMPASS tokamak. The InceptionTime architecture trained using a semi-supervised approach was found to be the most accurate method based on the set of tested variants. It is able to achieve good overall accuracy of the regime classification at the time instant of 100 μs delayed behind the latest data record. We also evaluate the capability of the model to correctly predict class transitions. While ELM occurrence can be detected with a tolerance smaller than 50 μs, detection of the confinement regime transition is more demanding and it was successful with 2 ms tolerance. Sensitivity studies to different values of model parameters are provided. We believe that the achieved accuracy is acceptable in practice and the method could be used in real-time operation.

Active Learning for Text Classification and Fake News Detection

Autoři: Sahan, M., prof. Ing. Václav Šmídl, Ph.D., Ing. Radek Mařík, CSc.,
Publikace: 2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC). Los Alamitos: IEEE Computer Society, 2021. p. 87-94. ISBN 978-1-6654-1627-6.
Rok: 2021

DOI: 10.1109/ISCSIC54682.2021.00027
Odkaz: https://doi.org/10.1109/ISCSIC54682.2021.00027
Pracoviště: Katedra telekomunikační techniky, Centrum umělé inteligence
Anotace:
Supervised classification of texts relies on the availability of reliable class labels for the training data. However, the process of collecting data labels can be complex and costly. A standard procedure is to add labels sequentially by querying an annotator until reaching satisfactory performance. Active learning is a process of selecting unlabeled data records for which the knowledge of the label would bring the highest discriminability of the dataset. In this paper, we provide a comparative study of various active learning strategies for different embeddings of the text on various datasets. We focus on Bayesian active learning methods that are used due to their ability to represent the uncertainty of the classification procedure. We compare three types of uncertainty representation: i) SGLD, ii) Dropout, and iii) deep ensembles. The latter two methods in cold- and warm-start versions. The texts were embedded using Fast Text, LASER, and RoBERTa encoding techniques. The methods are tested on two types of datasets, text categorization (Kaggle News Category and Twitter Sentiment140 dataset) and fake news detection (Kaggle Fake News and Fake News Detection datasets). We show that the conventional dropout Monte Carlo approach provides good results for the majority of the tasks. The ensemble methods provide more accurate representation of uncertainty that allows to keep the pace of learning of a complicated problem for the growing number of requests, outperforming the dropout in the long run. However, for the majority of the datasets the active strategy using Dropout MC and Deep Ensembles achieved almost perfect performance even for a very low number of requests. The best results were obtained for the most recent embeddings RoBERTa

Detection of Alfven Eigenmodes on COMPASS with Generative Neural Networks

Autoři: Škvára, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., Seidl, J., Havránek, A., Tskhakaya, D.
Publikace: Fusion Science & Technology. 2020, 76(8), 962-971. ISSN 1536-1055.
Rok: 2020

DOI: 10.1080/15361055.2020.1820805
Odkaz: https://doi.org/10.1080/15361055.2020.1820805
Pracoviště: Centrum umělé inteligence
Anotace:
Chirping Alfvén eigenmodes (AE) were observed at the COMPASS tokamak. They are believed to be driven by runaway electrons (RE) and as such, they provide a unique opportunity to study physics of non-linear interaction between RE and electromagnetic instabilities, including important topics of RE mitigation and losses. On COMPASS, they can be detected from spectrograms of certain magnetic probes. So far, their detection required a lot of manual effort since they occur rarely. We strive to automate this process using machine learning techniques based on generative neural networks. We present two different models that are trained using a smaller, manually labeled database and a larger unlabeled database from COMPASS experiments. On a number of experiments, we demonstrate that our approach is a viable option for automated detection of rare instabilities in tokamak plasma.

Neural Power Units

Autoři: Heim, N., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Montreal: Neural Information Processing Society, 2020. ISSN 1049-5258.
Rok: 2020

Pracoviště: Centrum umělé inteligence
Anotace:
Conventional Neural Networks can approximate simple arithmetic operations, but fail to generalize beyond the range of numbers that were seen during training. Neural Arithmetic Units aim to overcome this difficulty, but current arithmetic units are either limited to operate on positive numbers or can only represent a subset of arithmetic operations. We introduce the Neural Power Unit (NPU) that operates on the full domain of real numbers and is capable of learning arbitrary power functions in a single layer. The NPU thus fixes the shortcomings of existing arithmetic units and extends their expressivity. We achieve this by using complex arithmetic without requiring a conversion of the network to complex numbers. A simplification of the unit to the RealNPU yields a highly transparent model. We show that the NPUs outperform their competitors in terms of accuracy and sparsity on artificial arithmetic datasets, and that the RealNPU can discover the governing equations of a dynamical system only from data.

Sum-Product-Transform Networks: Exploiting Symmetries using Invertible Transformations

Autoři: doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D., Trapp, M., Poláček, O., Oberhuber, T.
Publikace: Proceedings of the 10th International Conference on Probabilistic Graphical Models. Proceedings of Machine Learning Research, 2020. p. 341-352. vol. 138. ISSN 2640-3498.
Rok: 2020

Pracoviště: Centrum umělé inteligence
Anotace:
We propose Sum-Product-Transform Networks (SPTN), an extension of sum-product networks that uses invertible transformations as additional internal nodes. The type and placement of transformations determine properties of the resulting SPTN with many interesting special cases. Importantly, SPTN with Gaussian leaves and affine transformations pose the same inference task tractable that can be computed efficiently in SPNs. We propose to store and optimize affine transformations in their SVD decompositions using an efficient parametrization of unitary matrices by a set of Givens rotations. Last but not least, we demonstrate that G-SPTNs pushes the state-of-the-art on the density estimation task on used datasets.

Rodent: Relevance determination in ODE

Autoři: Heim, N., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the Bayesian Deep Learning. Amsterdam: University of Amsterdam, 2019.
Rok: 2019

Pracoviště: Centrum umělé inteligence
Anotace:
From a set of observed trajectories of a partially observed system, we aim to learnits underlying (physical) process without having to make too many assumptionsabout the generating model. We start with a very general, over-parameterizedordinary differential equation(ODE) of orderNand learn the minimal complexityof the model, by which we mean both the order of the ODE as well as the minimumnumber of non-zero parameters that are needed to solve the problem. The minimalcomplexity is found by combining theVariational Auto-Encoder(VAE) withAuto-matic Relevance Determination(ARD) to the problem of learning the parametersof an ODE which we callRodent. We show that it is possible to learn not onlyone specific model for a single process, but a manifold of models representingharmonic signals in general.

Robust sparse linear regression for tokamak plasma boundary estimation using variational Bayes

Autoři: Škvára, V., prof. Ing. Václav Šmídl, Ph.D., Urban, J.
Publikace: Journal of Physics: Conference Series. Bristol: IOP Publishing Ltd, 2018. p. 2-13. vol. 1047. ISSN 1742-6596.
Rok: 2018

DOI: 10.1088/1742-6596/1047/1/012015
Odkaz: https://doi.org/10.1088/1742-6596/1047/1/012015
Pracoviště: Centrum umělé inteligence
Anotace:
Precise control of the shape of plasma in a tokamak requires reliable reconstruction of the plasma boundary. The problem of boundary estimation can be reduced to a simple linear regression with a potentially infinite amount of regressors. This regression problem poses some difficulties for classical methods. The selection of regressors significantly influences the reconstructed boundary. Also, the underlying model may not be valid during certain phases of the plasma discharge. Formal model structure estimation technique based on the automatic relevance principle yields a version of sparse least squares estimator. In this contribution, we extend the previous method by relaxing the assumption of Gaussian noise and using Student's t-distribution instead. Such a model is less sensitive to potential outliers in the measurement. We show on simulations and real data that the proposed modification improves estimation of the plasma boundary in some stages of a plasma discharge. Performance of the resulting algorithm is evaluated with respect to a more detailed and computationally costly model which is considered to be the "ground truth" The results are also compared to those of Lasso and Tikhonov regularization techniques.

prof. Ing. Václav Šmídl, Ph.D.

Všechny publikace

Deep anomaly detection on set data: Survey and comparison

Malicious Internet Entity Detection Using Local Graph Inference

Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs

Active Learning Efficiency Benchmark for Coreference Resolution Including Advanced Uncertainty Representations

Sum-Product-Set Networks

Batch Active Learning for Text Classification and Sentiment Analysis

Comparison of Anomaly Detectors: Context Matters

General framework for binary classification on top samples

Reducing the cost of fitting mixture models via stochastic sampling

Semi-supervised deep networks for plasma state identification

Active Learning for Text Classification and Fake News Detection

Detection of Alfven Eigenmodes on COMPASS with Generative Neural Networks

Neural Power Units

Sum-Product-Transform Networks: Exploiting Symmetries using Invertible Transformations

Rodent: Relevance determination in ODE

Robust sparse linear regression for tokamak plasma boundary estimation using variational Bayes

Mějte přehled