doc. Ing. Tomáš Pevný, Ph.D.

Bias Detection via Maximum Subgroup Discrepancy

Autoři: Ing. Jiří Němeček, Kozdoba, M., Kryvoviaz, I., doc. Ing. Tomáš Pevný, Ph.D., Mgr. Jakub Mareček, Ph.D.,
Publikace: KDD '25: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. New York: Association for Computing Machinery, 2025. p. 2174-2185. ISBN 979-8-4007-1454-2.
Rok: 2025

DOI: 10.1145/3711896.3736857
Odkaz: https://doi.org/10.1145/3711896.3736857
Pracoviště: Centrum umělé inteligence
Anotace:
Bias evaluation is fundamental to trustworthy AI, both in terms of checking data quality and in terms of checking the outputs of AI systems. In testing data quality, for example, one may study the distance of a given dataset, viewed as a distribution, to a given ground-truth reference dataset. However, classical metrics, such as the Total Variation and the Wasserstein distances, are known to have high sample complexities and, therefore, may fail to provide a meaningful distinction in many practical scenarios. In this paper, we propose a new notion of distance, the Maximum Subgroup Discrepancy (MSD). In this metric, two distributions are close if, roughly, discrepancies are low for all feature subgroups. While the number of subgroups may be exponential, we show that the sample complexity is linear in the number of features, thus making it feasible for practical applications. Moreover, we provide a practical algorithm for evaluating the distance based on Mixed-integer optimization (MIO). We also note that the proposed distance is easily interpretable, thus providing clearer paths to fixing the biases once they have been identified. Finally, we describe a natural general bias detection framework, termed MSDD distances, and show that MSD aligns well with this framework. We empirically evaluate MSD by comparing it with other metrics and by demonstrating the above properties of MSD on real-world datasets.

Generating Likely Counterfactuals Using Sum-Product Networks

Autoři: Ing. Jiří Němeček, doc. Ing. Tomáš Pevný, Ph.D., Mgr. Jakub Mareček, Ph.D.,
Publikace: LEARNING REPRESENTATIONS. INTERNATIONAL CONFERENCE. 13TH 2025. (ICLR 2025). International Conference on Learning Representations, 2025. p. 74233-74264. ISBN 9798331320850.
Rok: 2025

Pracoviště: Centrum umělé inteligence
Anotace:
The need to explain decisions made by AI systems is driven by both recent regulation and user demand. The decisions are often explainable only post hoc. In counterfactual explanations, one may ask what constitutes the best counterfactual explanation. Clearly, multiple criteria must be taken into account, although "distance from the sample" is a key criterion. Recent methods that consider the plausibility of a counterfactual seem to sacrifice this original objective. Here, we present a system that provides high-likelihood explanations that are, at the same time, close and sparse. We show that the search for the most likely explanations satisfying many common desiderata for counterfactual explanations can be modeled using Mixed-Integer Optimization (MIO). We use a Sum-Product Network (SPN) to estimate the likelihood of a counterfactual. To achieve that, we propose an MIO formulation of an SPN, which can be of independent interest. The source code with examples is available at https://github.com/Epanemu/LiCE.

State Encodings for GNN-Based Lifted Planners

Autoři: Ing. Rostislav Horčík, Ph.D., Ing. Gustav Šír, Ph.D., Šimek, V., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 39th AAAI Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2025. p. 26525-26533. vol. 39. ISSN 2159-5399.
Rok: 2025

DOI: 10.1609/aaai.v39i25.34853
Odkaz: https://doi.org/10.1609/aaai.v39i25.34853
Pracoviště: Centrum umělé inteligence, Intelligent Data Analysis
Anotace:
The application of graph neural networks (GNNs) to learn heuristic functions in classical planning is gaining traction. Despite the variety of methods proposed in the literature to encode classical planning tasks for GNNs, a comparative study evaluating their relative performances has been lacking. Moreover, some encodings have been assessed solely for their expressiveness rather than practical effectiveness in planning. This paper provides an extensive comparative analysis of existing encodings. Our results indicate that the smallest encoding based on Gaifman graphs, not yet applied in planning, outperforms the rest due to its fast evaluation times and the informativeness of the resulting heuristic. The overall coverage measured on the IPC almost reaches that of the state-of-the-art planner LAMA while exhibiting rather complementary strengths across different domains.

Classification with Costly Features in Hierarchical Deep Sets

Autoři: Janisch, J., doc. Ing. Tomáš Pevný, Ph.D., doc. Mgr. Viliam Lisý, MSc., Ph.D.,
Publikace: Machine Learning. 2024, 113(7), 4487-4522. ISSN 0885-6125.
Rok: 2024

DOI: 10.1007/s10994-024-06565-4
Odkaz: https://doi.org/10.1007/s10994-024-06565-4
Pracoviště: Centrum umělé inteligence
Anotace:
Classification with costly features (CwCF) is a classification problem that includes the cost of features in the optimization criteria. Individually for each sample, its features are sequentially acquired to maximize accuracy while minimizing the acquired features' cost. However, existing approaches can only process data that can be expressed as vectors of fixed length. In real life, the data often possesses rich and complex structure, which can be more precisely described with formats such as XML or JSON. The data is hierarchical and often contains nested lists of objects. In this work, we extend an existing deep reinforcement learning-based algorithm with hierarchical deep sets and hierarchical softmax, so that it can directly process this data. The extended method has greater control over which features it can acquire and, in experiments with seven datasets, we show that this leads to superior performance. To showcase the real usage of the new method, we apply it to a real-life problem of classifying malicious web domains, using an online service.

Deep anomaly detection on set data: Survey and comparison

Autoři: Ing. Michaela Mašková, Ing. Matěj Zorek, doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Pattern recognition. 2024, 151 ISSN 0031-3203.
Rok: 2024

DOI: 10.1016/j.patcog.2024.110381
Odkaz: https://doi.org/10.1016/j.patcog.2024.110381
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Detecting anomalous samples in set data is a problem attracting increased interest due to novel modalities, such as point-cloud data produced by lidars. Novel methods including those based on deep neural networks are often tuned for a single purpose prohibiting intuition of how relevant they are for another purpose or application domains. The aim of this survey is to: (i) review elementary concepts of anomaly detection of set data, (ii) identify the building blocks of deep anomaly detectors, and (iii) analyze the impact of these blocks on performance. The impact is studied in a large experimental comparison on a variety of benchmark datasets. The results reveal that the main factor determining the performance is the type of anomalies in the dataset. While deep methods embedding the whole set to a single fixed vector perform well on point cloud data, the methods embedding each feature vector independently are better for datasets from multi-instance learning. Moreover, sophisticated methods utilizing transformer blocks are frequently inferior to simple models with properly optimized hyperparameters. An independent factor in performance is the cardinality of sets, the proper treatment of which remains an open problem, as the existing analytical solution was found to be inaccurate.

GraphSPNs: Sum-Product Networks Benefit From Canonical Orderings

Autoři: Ing. Milan Papež, Ph.D., Ing. Martin Rektoris, prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 7th Workshop on Tractable Probabilistic Modeling. Massachusetts: OpenReview.net / University of Massachusetts, 2024.
Rok: 2024

Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Deep generative models have recently made a remarkable progress in capturing complex probability distributions over graphs. However, they are intractable and thus unable to answer even the most basic probabilistic inference queries without resorting to approximations. Therefore, we propose graph sum-product networks (GraphSPNs), a tractable deep generative model which provides exact and efficient inference over (arbitrary parts of) graphs. We investigate different principles to make SPNs permutation invariant. We demonstrate that GraphSPNs are able to (conditionally) generate novel and chemically valid molecular graphs, being competitive to, and sometimes even better than, existing intractable models. We find out that (Graph)SPNs benefit from ensuring the permutation invariance via canonical ordering.

Malicious Internet Entity Detection Using Local Graph Inference

Autoři: Mandlík, Š., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D., Bajer, L.
Publikace: IEEE Transactions on Information Forensics and Security. 2024, 19 3554-3566. ISSN 1556-6013.
Rok: 2024

DOI: 10.1109/TIFS.2024.3360867
Odkaz: https://doi.org/10.1109/TIFS.2024.3360867
Pracoviště: Centrum umělé inteligence
Anotace:
Detection of malicious behavior in a large network is a challenging problem for machine learning in computer security, since it requires a model with high expressive power and scalable inference. Existing solutions struggle to achieve this feat—current cybersec-tailored approaches are still limited in expressivity, and methods successful in other domains do not scale well for large volumes of data, rendering frequent retraining impossible. This work proposes a new perspective for learning from graph data that is modeling network entity interactions as a large heterogeneous graph. High expressivity of the method is achieved with neural network architecture HMILnet that naturally models this type of data and provides theoretical guarantees. The scalability is achieved by pursuing local graph inference, i.e., classifying individual vertices and their neighborhood as independent samples. Our experiments exhibit improvement over the state-of-the-art Probabilistic Threat Propagation (PTP) algorithm, show a further threefold accuracy improvement when additional data is used, which is not possible with the PTP algorithm, and demonstrate the generalization capabilities of the method to new, previously unseen entities.

NASimEmu: Network Attack Simulator & Emulator for Training Agents Generalizing to Novel Scenarios

Autoři: Janisch, J., doc. Ing. Tomáš Pevný, Ph.D., doc. Mgr. Viliam Lisý, MSc., Ph.D.,
Publikace: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Science and Business Media Deutschland GmbH, 2024. p. 589-608. vol. 14399. ISSN 0302-9743. ISBN 978-3-031-54128-5.
Rok: 2024

DOI: 10.1007/978-3-031-54129-2_35
Odkaz: https://doi.org/10.1007/978-3-031-54129-2_35
Pracoviště: Centrum umělé inteligence
Anotace:
Current frameworks for training offensive penetration testing agents with deep reinforcement learning struggle to produce agents that perform well in real-world scenarios, due to the reality gap in simulation-based frameworks and the lack of scalability in emulation-based frameworks. Additionally, existing frameworks often use an unrealistic metric that measures the agents' performance on the training data. NASimEmu, a new framework introduced in this paper, addresses these issues by providing both a simulator and an emulator with a shared interface. This approach allows agents to be trained in simulation and deployed in the emulator, thus verifying the realism of the used abstraction. Our framework promotes the development of general agents that can transfer to novel scenarios unseen during their training. For the simulation part, we adopt an existing simulator NASim and enhance its realism. The emulator is implemented with industry-level tools, such as Vagrant, VirtualBox, and Metasploit. Experiments demonstrate that a simulation-trained agent can be deployed in emulation, and we show how to use the framework to train a general agent that transfers into novel, structurally different scenarios. NASimEmu is available as open-source.

On the Economics of Adversarial Machine Learning

Autoři: Merkle, F., Samsinger, M., Schottle, P., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Information Forensics and Security. 2024, 19(2024), 4670-4685. ISSN 1556-6013.
Rok: 2024

DOI: 10.1109/TIFS.2024.3379829
Odkaz: https://doi.org/10.1109/TIFS.2024.3379829
Pracoviště: Centrum umělé inteligence
Anotace:
Given the widespread deployment of machine learning algorithms, the security of these algorithms and thus, the field of adversarial machine learning gained popularity in the research community. In this article, we loosen several unrealistic restrictions found in prior art and bring economical-inspired adversarial machine learning one step closer to being applicable in the real world. First, we extend our own game-theoretical framework such that it allows any arbitrary number of actions for both actors, and analytically determine equilibrium strategies and conditions where mixed strategies are expected for the specific case in which both actors choose from any two arbitrary actions. Then, we pay special attention to an adversary's knowledge about the attacked system by modeling them as a white-, gray-, or black-box adversary. We conduct extensive experiments for three architectures, two training procedures, and four adversarial attacks in different variations as direct and transfer attacks, resulting in 300 data points consisting of the respective accuracy and robustness values and the computational costs for both actors. We then instantiate our model with this data and explore the structure of the game for a wide range of each game parameter, overcoming the complexity by applying algorithmic game theory. We discover surprising properties in the actors' strategies, such as the feasibility of cheap attacks that have been dismissed as practically irrelevant so far - examples include universal adversarial perturbations or (transfer) attacks utilizing only few optimization steps. For the defender, we find that given recent attacks and countermeasures, a rational defender would try to hide as much as possible from their infrastructure.

Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs

Autoři: Ing. Milan Papež, Ph.D., Ing. Martin Rektoris, prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceeding The Twelfth International Conference on Learning Representations (ICLR 2024). International Conference on Learning Representations, 2024. ISBN 9781713898658.
Rok: 2024

Pracoviště: Centrum umělé inteligence
Anotace:
Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

Heuristic Search Optimisation Using Planning and Curriculum Learning Techniques

Autoři: Chrestien, L., doc. Ing. Tomáš Pevný, Ph.D., Ing. Antonín Lištiak Komenda, Ph.D., Dr. Stefan Edelkamp,
Publikace: Progress in Artificial Intelligence, 22nd EPIA Conference on Artificial Intelligence, EPIA 2023, Faial Island, Azores, September 5–8, 2023, Proceedings, Part I. Springer, Cham, 2023. p. 495-507. vol. 1. ISSN 0302-9743. ISBN 978-3-031-49007-1.
Rok: 2023

DOI: 10.1007/978-3-031-49008-8_39
Odkaz: https://doi.org/10.1007/978-3-031-49008-8_39
Pracoviště: Centrum umělé inteligence
Anotace:
Learning a well-informed heuristic function for hard planning domains is an elusive problem. Although there are known neural network architectures to represent such heuristic knowledge, it is not obvious what concrete information is learned and whether techniques aimed at understanding the structure help in improving the quality of the heuristics. This paper presents a network model that learns a heuristic function capable of relating distant parts of the state space via optimal plan imitation using the attention mechanism which drastically improves the learning of a good heuristic function. The learning of this heuristic function is further improved by the use of curriculum learning, where newly solved problem instances are added to the training set, which, in turn, helps to solve problems of higher complexities and train from harder problem instances. The methodologies used in this paper far exceed the performances of all existing baselines including known deep learning approaches and classical planning heuristics. We demonstrate its effectiveness and success on grid-type PDDL domains, namely Sokoban, maze-with-teleports and sliding tile puzzles.

Leveraging Data Geometry to Mitigate CSM in Steganalysis

Autoři: Abecidan, R., Itier, V., Boulanger, J., Bas, P., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE International Workshop on Information Forensics and Security. New Jersey: IEEE Signal Processing Society, 2023. ISSN 2157-4774. ISBN 979-8-3503-2491-4.
Rok: 2023

DOI: 10.1109/WIFS58808.2023.10374944
Odkaz: https://doi.org/10.1109/WIFS58808.2023.10374944
Pracoviště: Centrum umělé inteligence
Anotace:
In operational scenarios, steganographers use sets of covers from various sensors and processing pipelines that differ significantly from those used by researchers to train steganalysis models. This leads to an inevitable performance gap when dealing with out-of-distribution covers, commonly referred to as Cover Source Mismatch (CSM). In this study, we consider the scenario where test images are processed using the same pipeline. However, knowledge regarding both the labels and the balance between cover and stego is missing. Our objective is to identify a training dataset that allows for maximum generalization to our target. By exploring a grid of processing pipelines fostering CSM, we discovered a geometrical metric based on the chordal distance between subspaces spanned by DCTr features, that exhibits high correlation with operational regret while being not affected by the cover-stego balance. Our contribution lies in the development of a strategy that enables the selection or derivation of customized training datasets, enhancing the overall generalization performance for a given target. Experimental validation highlights that our geometry-based optimization strategy outperforms traditional atomistic methods given reasonable assumptions. Additional resources are available at github.com/RonyAbecidan/LeveragingGeometrytoMitigateCSM.

Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal.

Autoři: Chrestien, L., doc. Ing. Tomáš Pevný, Ph.D., Dr. Stefan Edelkamp, Ing. Antonín Lištiak Komenda, Ph.D.,
Publikace: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Montreal: Neural Information Processing Society, 2023. vol. 36. ISSN 1049-5258.
Rok: 2023

Pracoviště: Centrum umělé inteligence
Anotace:
In imitation learning for planning, parameters of heuristic functions are optimized against a set of solved problem instances. This work revisits the necessary and sufficient conditions of strictly optimally efficient heuristics for forward search algorithms, mainly A* and greedy best-first search, which expand only states on the returned optimal path. It then proposes a family of loss functions based on ranking tailored for a given variant of the forward search algorithm. Furthermore, from a learning theory point of view, it discusses why optimizing cost-to-goal h* is unnecessarily difficult. The experimental comparison on a diverse set of problems unequivocally supports the derived theory.

Sum-Product-Set Networks

Autoři: Ing. Milan Papež, Ph.D., Ing. Martin Rektoris, doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Proceedings of the 6th Workshop on Tractable Probabilistic Modeling. Massachusetts: OpenReview.net / University of Massachusetts, 2023.
Rok: 2023

Pracoviště: Centrum umělé inteligence
Anotace:
Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

The Non-Zero-Sum Game of Steganography in Heterogeneous Environments

Autoři: Giboulot, Q., doc. Ing. Tomáš Pevný, Ph.D., Ker, A.D.
Publikace: IEEE Transactions on Information Forensics and Security. 2023, 18 4436-4448. ISSN 1556-6013.
Rok: 2023

DOI: 10.1109/TIFS.2023.3295945
Odkaz: https://doi.org/10.1109/TIFS.2023.3295945
Pracoviště: Centrum umělé inteligence
Anotace:
The highly heterogeneous nature of images found in real-world environments, such as online sharing platforms, has been one of the long-standing obstacles to the transition of steganalysis techniques outside the laboratory. Recent advances in identifying the properties of images relevant to steganalysis as well as the effectiveness of deep neural networks on highly heterogeneous datasets have laid some groundwork for resolving this problem. Despite this progress, we argue that the way the game played between the steganographer and the steganalyst is currently modeled lacks some important features expected in a real-world environment: 1) the steganographer can adapt her cover source choice to the environment and/or to the steganalyst's classifier, 2) the distribution of cover sources in the environment impacts the optimal threshold for a given classifier, and 3) the steganalyst and steganographer have different goals, hence different utilities. We propose to take these facts into account using a two-player non-zero-sum game constrained by an environment composed of multiple cover sources. We then show how to convert this non-zero-sum game into an equivalent zero-sum game, allowing us to propose two methods to find Nash equilibria for this game: a standard method using the double oracle algorithm and a minimum regret method based on approximating a set of atomistic classifiers. Applying these methods to contemporary steganography and steganalysis in a realistic environment, we show that classifiers which do not adapt to the environment severely underperform when the steganographer is allowed to select into which cover source to embed.

Backpack: A Backpropagable Adversarial Embedding Schem

Autoři: Bernard, S., Bas, P., Klein, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Information Forensics and Security. 2022, 17 3539-3554. ISSN 1556-6013.
Rok: 2022

DOI: 10.1109/TIFS.2022.3204218
Odkaz: https://doi.org/10.1109/TIFS.2022.3204218
Pracoviště: Centrum umělé inteligence
Anotace:
A minmax protocol offers a general method to automatically optimize steganographic algorithm against a wide class of steganalytic detectors. The quality of the resulting steganograhic algorithm depends on the ability to find an 'adversarial' stego image undetectable by a set of detectors while communicating a given message. Despite minmax protocol instantiated with ADV-EMB scheme leading to unexpectedly good results, we show it suffers a significant flaw and we present a theoretically sound solution called Backpack. Extensive experimental verification of minmax protocol with Backpack shows superior performance to ADV-EMB, the generality of the tool by targeting a new JPEG QF100 compatibility attack and further improves the security of steganographic algorithms.

Comparison of Anomaly Detectors: Context Matters

Autoři: Škvára, V., Ing. Jan Franců, Ing. Matěj Zorek, prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Neural Networks and Learning Systems. 2022, 33(6), 2494-2507. ISSN 2162-2388.
Rok: 2022

DOI: 10.1109/TNNLS.2021.3116269
Odkaz: https://doi.org/10.1109/TNNLS.2021.3116269
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Deep generative models are challenging the classical methods in the field of anomaly detection nowadays. Every newly published method provides evidence of outperforming its predecessors, sometimes with contradictory results. The objective of this article is twofold: to compare anomaly detection methods of various paradigms with a focus on deep generative models and identification of sources of variability that can yield different results. The methods were compared on popular tabular and image datasets. We identified that the main sources of variability are the experimental conditions: 1) the type of dataset (tabular or image) and the nature of anomalies (statistical or semantic) and 2) strategy of selection of hyperparameters, especially the number of available anomalies in the validation set. Methods perform differently in different contexts, i.e., under a different combination of experimental conditions together with computational time. This explains the variability of the previous results and highlights the importance of careful specification of the context in the publication of a new method. All our code and results are available for download.

Formalizing Cover-source Mismatch as a Robust Optimization

Autoři: Šepák, D., Adam, L., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of European Signal Processing Conference. Belgrade: EUSIPCO, 2022. p. 1042-1046. ISBN 978-1-6654-6798-8.
Rok: 2022

Pracoviště: Centrum umělé inteligence
Anotace:
Cover-source mismatch (CSM) refers to the use of a steganographic detector on images with a very different probability distribution it has been trained on. This can have a detrimental effect on its accuracy preventing the use of modern steganalytic tools outside laboratories. Despite CSM being introduced almost fifteen years ago, there is no formal definition and no adopted measures for comparing different solutions. This work, therefore, formalizes the cover-source mismatch and proposes and discusses possible error measures. Equipped with these tools, we propose a principled approach to train holistic detectors while minimizing the effects of CSM and experimentally compare them to the prior art, discussing their strength and weaknesses.

General framework for binary classification on top samples

Autoři: Adam, L., Mácha, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Optimization Methods and Software. 2022, 37(5), 1636-1667. ISSN 1029-4937.
Rok: 2022

DOI: 10.1080/10556788.2021.1965601
Odkaz: https://doi.org/10.1080/10556788.2021.1965601
Pracoviště: Centrum umělé inteligence
Anotace:
Many binary classification problems minimize misclassification above (or below) a threshold. We show that instances of ranking problems, accuracy at the top, or hypothesis testing may be written in this form. We propose a general framework to handle these classes of problems and show which formulations (both known and newly proposed) fall into this framework. We provide a theoretical analysis of this framework and mention selected possible pitfalls the formulations may encounter. We show the convergence of the stochastic gradient descent for selected formulations even though the gradient estimate is inherently biased. We suggest several numerical improvements, including the implicit derivative and stochastic gradient descent. We provide an extensive numerical study.

JsonGrinder.jl: Automated Differentiable Neural Architecture for Embedding Arbitrary JSON Data

Autoři: Mandlík, Š., Račinský, M., doc. Mgr. Viliam Lisý, MSc., Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Journal of Machine Learning Research. 2022, 23 1-5. ISSN 1532-4435.
Rok: 2022

Pracoviště: Centrum umělé inteligence
Anotace:
Standard machine learning (ML) problems are formulated on data converted into a suitable tensor representation. However, there are data sources, for example in cybersecurity, that are naturally represented in a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to a tensor representation is usually done by manual feature engineering, which is laborious, lossy, and prone to bias originating from the human inability to correctly judge the importance of particular features. JsonGrinder.jl is a library automating various ML tasks on these difficult sources. Starting with an arbitrary set of JSON samples, it automatically creates a differentiable ML model (called hmilnet), which embeds raw JSON samples into a fixed-size tensor representation. This embedding network can be naturally extended by an arbitrary ML model expecting tensor inputs in order to perform classification, regression, or clustering.

Reducing the cost of fitting mixture models via stochastic sampling

Autoři: Ing. Milan Papež, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Proceedings of the 5th Workshop on Tractable Probabilistic Modeling. Eindhoven: Eindhoven University of Technology, 2022.
Rok: 2022

Pracoviště: Centrum umělé inteligence
Anotace:
Traditional methods for unsupervised learning of finite mixture models require to evaluate the likelihood of all components of the mixture. This quickly becomes prohibitive when the components are abundant or expensive to compute. Therefore, we propose to apply a combination of the expectation maximization and the Metropolis-Hastings algorithm to evaluate only a small number of, stochastically sampled, components, thus substantially reducing the computational cost. The Markov chain of component assignments is sequentially generated across the algorithm's iterations, having a non-stationary target distribution whose parameters vary via a gradient-descent scheme. We put emphasis on generality of our method, equipping it with the ability to train mixture models which involve complex, and possibly nonlinear, transformations. The performance of our method is illustrated on mixtures of normalizing flows.

Semi-supervised deep networks for plasma state identification

Autoři: Ing. Matěj Zorek, Škvára, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., Seidl, J., Grover, O.
Publikace: Plasma Physics and Controlled Fusion. 2022, 64(12), 1-16. ISSN 1361-6587.
Rok: 2022

DOI: 10.1088/1361-6587/ac9926
Odkaz: https://doi.org/10.1088/1361-6587/ac9926
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
Correct and timely detection of plasma confinement regimes and edge localized modes (ELMs) is important for improving the operation of tokamaks. Existing machine learning approaches detect these regimes as a form of post-processing of experimental data. Moreover, they are typically trained on a large dataset of tens of labeled discharges, which may be costly to build. We investigate the ability of current machine learning approaches to detect the confinement regime and ELMs with the smallest possible delay after the latest measurement. We also demonstrate that including unlabeled data into the training process can improve the results in a situation where only a limited set of reliable labels is available. All training and validation is performed on data from the COMPASS tokamak. The InceptionTime architecture trained using a semi-supervised approach was found to be the most accurate method based on the set of tested variants. It is able to achieve good overall accuracy of the regime classification at the time instant of 100 μs delayed behind the latest data record. We also evaluate the capability of the model to correctly predict class transitions. While ELM occurrence can be detected with a tolerance smaller than 50 μs, detection of the confinement regime transition is more demanding and it was successful with 2 ms tolerance. Sensitivity studies to different values of model parameters are provided. We believe that the achieved accuracy is acceptable in practice and the method could be used in real-time operation.

Using Set Covering to Generate Databases for Holistic Steganalysis

Autoři: Abecidan, R., Itier, V., Boulanger, J., Bas, P., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE International Workshop on Information Forensics and Security. Institute of Electrical and Electronics Engineers, Inc., 2022. ISSN 2157-4766. ISBN 979-8-3503-0967-6.
Rok: 2022

DOI: 10.1109/WIFS55849.2022.9975430
Odkaz: https://doi.org/10.1109/WIFS55849.2022.9975430
Pracoviště: Centrum umělé inteligence
Anotace:
Within an operational framework, covers used by a steganographer are likely to come from different sensors and different processing pipelines than the ones used by researchers for training their steganalysis models. Thus, a performance gap is unavoidable when it comes to out-of-distributions covers, an extremely frequent scenario called Cover Source Mismatch (CSM). Here, we explore a grid of processing pipelines to study the origins of CSM, to better understand it, and to better tackle it. A set-covering greedy algorithm is used to select representative pipelines minimizing the maximum regret between the representative and the pipelines within the set. Our main contribution is a methodology for generating relevant bases able to tackle operational CSM. Experimental validation highlights that, for a given number of training samples, our set covering selection is a better strategy than selecting random pipelines or using all the available pipelines. Our analysis also shows that parameters as denoising, sharpening, and downsampling are very important to foster diversity. Finally, different benchmarks for classical and wild databases show the good generalization property of the extracted databases.

Explicit Optimization of min max Steganographic Game

Autoři: Bernard, S., Bas, P., Klein, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Information Forensics and Security. 2021, 2020(16), 812-823. ISSN 1556-6013.
Rok: 2021

DOI: 10.1109/TIFS.2020.3021913
Odkaz: https://doi.org/10.1109/TIFS.2020.3021913
Pracoviště: Centrum umělé inteligence
Anotace:
This article proposes an algorithm which allows Alice to simulate the game played between her and Eve. Under the condition that the set of detectors that Alice assumes Eve to have is sufficiently rich (e.g. CNNs), and that she has an algorithm enabling to avoid detection by a single classifier (e.g adversarial embedding, gibbs sampler, dynamic STCs), the proposed algorithm converges to an efficient steganographic algorithm. This is possible by using a min max strategy which consists at each iteration in selecting the least detectable stego image for the best classifier among the set of Eve's learned classifiers. The algorithm is extensively evaluated and compared to prior arts and results show the potential to increase the practical security of classical steganographic methods. For example the error probability P-err of XU-Net on detecting stego images with payload of 0.4 bpnzAC embedded by J-Uniward and QF 75 starts at 7.1% and is increased by +13.6% to reach 20.7% after eight iterations. For the same embedding rate and for QF 95, undetectability by XU-Net with J-Uniward embedding is 23.4%, and it jumps by +25.8% to reach 49.2% at iteration 3.

When Should You Defend Your Classifier? A Game-Theoretical Analysis of Countermeasures Against Adversarial Examples

Autoři: Samsinger, M., Merkle, F., Schöttle, P., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: International Conference on Decision and Game Theory for Security. Basel: Springer Nature Switzerland AG, 2021. p. 158-177. ISSN 0302-9743. ISBN 978-3-030-90369-5.
Rok: 2021

DOI: 10.1007/978-3-030-90370-1_9
Odkaz: https://doi.org/10.1007/978-3-030-90370-1_9
Pracoviště: Centrum umělé inteligence
Anotace:
Adversarial machine learning, i.e., increasing the robustness of machine learning algorithms against so-called adversarial examples, is now an established field. Yet, newly proposed methods are evaluated and compared under unrealistic scenarios where costs for adversary and defender are not considered and either all samples or no samples are adversarially perturbed. We scrutinize these assumptions and propose the advanced adversarial classification game, which incorporates all relevant parameters of an adversary and a defender. Especially, we take into account economic factors on both sides and the fact that all so far proposed countermeasures against adversarial examples reduce accuracy on benign samples. Analyzing the scenario in detail, where both players have two pure strategies, we identify all best responses and conclude that in practical settings, the most influential factor might be the maximum amount of adversarial examples.

Anomaly explanation with random forests

Autoři: Kopp, M., doc. Ing. Tomáš Pevný, Ph.D., Holeňa, M.
Publikace: Expert Systems with Applications. 2020, 2020(149), ISSN 0957-4174.
Rok: 2020

DOI: 10.1016/j.eswa.2020.113187
Odkaz: https://doi.org/10.1016/j.eswa.2020.113187
Pracoviště: Centrum umělé inteligence
Anotace:
Anomaly detection has become an important topic in many domains with many different solutions proposed until now. Despite that, there are only a few anomaly detection methods trying to explain how the sample differs from the rest. This work contributes to filling this gap because knowing why a sample is considered anomalous is critical in many application domains. The proposed solution uses a specific type of random forests to extract rules explaining the difference, which are then filtered and presented to the user as a set of classification rules sharing the same consequent, or as the equivalent rule with an antecedent in a disjunctive normal form. The quality of that solution is documented by comparison with the state of the art algorithms on 34 real-world datasets.

Classification with Costly Features as a Sequential Decision-making Problem

Autoři: Janisch, J., doc. Ing. Tomáš Pevný, Ph.D., doc. Mgr. Viliam Lisý, MSc., Ph.D.,
Publikace: Machine Learning. 2020, 109(8), 1587-1615. ISSN 0885-6125.
Rok: 2020

DOI: 10.1007/s10994-020-05874-8
Odkaz: https://doi.org/10.1007/s10994-020-05874-8
Pracoviště: Centrum umělé inteligence
Anotace:
This work focuses on a specific classification problem, where the information about a sample is not readily available, but has to be acquired for a cost, and there is a per-sample budget. Inspired by real-world use-cases, we analyze average and hard variations of a directly specified budget. We postulate the problem in its explicit formulation and then convert it into an equivalent MDP, that can be solved with deep reinforcement learning. Also, we evaluate a real-world inspired setting with sparse training datasets with missing features. The presented method performs robustly well in all settings across several distinct datasets, outperforming other prior-art algorithms. The method is flexible, as showcased with all mentioned modifications and can be improved with any domain independent advancement in RL.

Detection of Alfven Eigenmodes on COMPASS with Generative Neural Networks

Autoři: Škvára, V., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D., Seidl, J., Havránek, A., Tskhakaya, D.
Publikace: Fusion Science & Technology. 2020, 76(8), 962-971. ISSN 1536-1055.
Rok: 2020

DOI: 10.1080/15361055.2020.1820805
Odkaz: https://doi.org/10.1080/15361055.2020.1820805
Pracoviště: Centrum umělé inteligence
Anotace:
Chirping Alfvén eigenmodes (AE) were observed at the COMPASS tokamak. They are believed to be driven by runaway electrons (RE) and as such, they provide a unique opportunity to study physics of non-linear interaction between RE and electromagnetic instabilities, including important topics of RE mitigation and losses. On COMPASS, they can be detected from spectrograms of certain magnetic probes. So far, their detection required a lot of manual effort since they occur rarely. We strive to automate this process using machine learning techniques based on generative neural networks. We present two different models that are trained using a smaller, manually labeled database and a larger unlabeled database from COMPASS experiments. On a number of experiments, we demonstrate that our approach is a viable option for automated detection of rare instabilities in tokamak plasma.

Loss Functions for Clustering in Multi-instance Learning

Autoři: Dědič, M., doc. Ing. Tomáš Pevný, Ph.D., Bajer, L., Holena, M.
Publikace: Proceedings of the 20th Conference Information Technologies - Applications and Theory (ITAT 2020). Aachen: CEUR Workshop Proceedings, 2020. p. 137-146. ISSN 1613-0073.
Rok: 2020

Pracoviště: Centrum umělé inteligence
Anotace:
Multi-instance learning belongs to one of recently fast developing areas of machine learning. It is a supervised learning method and this paper reports research into its unsupervised counterpart, multi-instance clustering. Whereas traditional clustering clusters points, multiinstance clustering clusters bags, i.e. multisets of points or of other kinds of objects. The paper focuses on the problem of loss functions for clustering. Three sophisticated loss functions used for clustering of points, contrastive predictive coding, triplet loss and magnet loss, are elaborated for multi-instance clustering. Finally, they are compared on 18 benchmark datasets, as well as on a real-world dataset.

Neural Power Units

Autoři: Heim, N., doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D.,
Publikace: Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Montreal: Neural Information Processing Society, 2020. ISSN 1049-5258.
Rok: 2020

Pracoviště: Centrum umělé inteligence
Anotace:
Conventional Neural Networks can approximate simple arithmetic operations, but fail to generalize beyond the range of numbers that were seen during training. Neural Arithmetic Units aim to overcome this difficulty, but current arithmetic units are either limited to operate on positive numbers or can only represent a subset of arithmetic operations. We introduce the Neural Power Unit (NPU) that operates on the full domain of real numbers and is capable of learning arbitrary power functions in a single layer. The NPU thus fixes the shortcomings of existing arithmetic units and extends their expressivity. We achieve this by using complex arithmetic without requiring a conversion of the network to complex numbers. A simplification of the unit to the RealNPU yields a highly transparent model. We show that the NPUs outperform their competitors in terms of accuracy and sparsity on artificial arithmetic datasets, and that the RealNPU can discover the governing equations of a dynamical system only from data.

Sum-Product-Transform Networks: Exploiting Symmetries using Invertible Transformations

Autoři: doc. Ing. Tomáš Pevný, Ph.D., prof. Ing. Václav Šmídl, Ph.D., Trapp, M., Poláček, O., Oberhuber, T.
Publikace: Proceedings of the 10th International Conference on Probabilistic Graphical Models. Proceedings of Machine Learning Research, 2020. p. 341-352. vol. 138. ISSN 2640-3498.
Rok: 2020

Pracoviště: Centrum umělé inteligence
Anotace:
We propose Sum-Product-Transform Networks (SPTN), an extension of sum-product networks that uses invertible transformations as additional internal nodes. The type and placement of transformations determine properties of the resulting SPTN with many interesting special cases. Importantly, SPTN with Gaussian leaves and affine transformations pose the same inference task tractable that can be computed efficiently in SPNs. We propose to store and optimize affine transformations in their SVD decompositions using an efficient parametrization of unitary matrices by a set of Givens rotations. Last but not least, we demonstrate that G-SPTNs pushes the state-of-the-art on the density estimation task on used datasets.

Classification with Costly Features Using Deep Reinforcement Learning

Autoři: Janisch, J., doc. Ing. Tomáš Pevný, Ph.D., doc. Mgr. Viliam Lisý, MSc., Ph.D.,
Publikace: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence. Menlo Park, California: AAAI Press, 2019. p. 3959-3966. ISSN 2159-5399. ISBN 978-1-57735-809-1.
Rok: 2019

DOI: 10.1609/aaai.v33i01.33013959
Odkaz: https://doi.org/10.1609/aaai.v33i01.33013959
Pracoviště: Centrum umělé inteligence
Anotace:
We study a classification problem where each feature can be acquired for a cost and the goal is to optimize a trade-off between the expected classification error and the feature cost.We revisit a former approach that has framed the problem as a sequential decision-making problem and solved it by Q-learning with a linear approximation, where individual actions are either requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems, we demonstrate that by replacing the linear approximation with neural networks the approach becomes comparable to the state-of-the-art algorithms developed specifically for this problem. The approach is flexible, as it can be improved with any new reinforcement learning enhancement, it allows inclusion of pre-trained high-performance classifier, and unlike prior art, its performance is robust across all evaluated datasets.

Exploiting Adversarial Embeddings for Better Steganography

Autoři: Bernard, S., doc. Ing. Tomáš Pevný, Ph.D., Bas, T., Klein, J.
Publikace: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2019. p. 216-221. ISBN 978-1-4503-6821-6.
Rok: 2019

DOI: 10.1145/3335203.3335737
Odkaz: https://doi.org/10.1145/3335203.3335737
Pracoviště: Centrum umělé inteligence
Anotace:
This work proposes a protocol to iteratively build a distortion function for adaptive steganography while increasing its practical security after each iteration. It relies on prior art on targeted attacks and iterative design of steganalysis schemes. It combines targeted attacks on a given detector with a \min\max strategy, which dynamically selects the most difficult stego content associated with the best classifier at each iteration. We theoretically prove the convergence, which is confirmed by the practical results. Applied on J-Uniward this new protocol increases \perr from 7% to 20% estimated by Xu-Net, and from 10% to 23% for a non-targeted steganalysis by a linear classifier with GFR features.

Joint Detection of Malicious Domains and Infected Clients

Autoři: Presse, P., Knaebel, R., Machlica, L., doc. Ing. Tomáš Pevný, Ph.D., Scheffer, T.
Publikace: Machine Learning. 2019, 108(8-9), 1353-1368. ISSN 0885-6125.
Rok: 2019

DOI: 10.1007/s10994-019-05789-z
Odkaz: https://doi.org/10.1007/s10994-019-05789-z
Pracoviště: Centrum umělé inteligence
Anotace:
Detection of malware-infected computers and detection of malicious web domains based on their encrypted HTTPS traffic are challenging problems, because only addresses, timestamps, and data volumes are observable. The detection problems are coupled, because infected clients tend to interact with malicious domains. Traffic data can be collected at a large scale, and antivirus tools can be used to identify infected clients in retrospect. Domains, by contrast, have to be labeled individually after forensic analysis. We explore transfer learning based on sluice networks; this allows the detection models to bootstrap each other. In a large-scale experimental study, we find that the model outperforms known reference models and detects previously unknown malware, previously unknown malware families, and previously unknown malicious domains.

Orthogonal Approximation of Marginal Likelihood of Generative Models

Autoři: Šmídl, V., Bím, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the Bayesian Deep Learning. Amsterdam: University of Amsterdam, 2019.
Rok: 2019

Pracoviště: Centrum umělé inteligence
Anotace:
This paper presents a new approximation of the marginal likelihood of generativemodels which is used as a score for anomaly detection. The score is motivatedby the shortcoming of the popular reconstruction error that it can behave arbitrar-ily outside the known samples. The proposed score corrects this by orthogonalcombination of the reconstruction error and the likelihood in the latent space. Asexperimentally shown on benchmark problems from anomaly detection and illus-trated on a toy problem, this combination lends the score robustness to outliers.Generative models evaluated with this score outperformed the competing meth-ods especially in tasks of learning distribution from data corrupted by anomalies.Finally, the score is compatible with contemporary generative models, namelyvariational auto-encoders and generative adversarial networks.

Rodent: Relevance determination in ODE

Autoři: Heim, N., prof. Ing. Václav Šmídl, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the Bayesian Deep Learning. Amsterdam: University of Amsterdam, 2019.
Rok: 2019

Pracoviště: Centrum umělé inteligence
Anotace:
From a set of observed trajectories of a partially observed system, we aim to learnits underlying (physical) process without having to make too many assumptionsabout the generating model. We start with a very general, over-parameterizedordinary differential equation(ODE) of orderNand learn the minimal complexityof the model, by which we mean both the order of the ODE as well as the minimumnumber of non-zero parameters that are needed to solve the problem. The minimalcomplexity is found by combining theVariational Auto-Encoder(VAE) withAuto-matic Relevance Determination(ARD) to the problem of learning the parametersof an ODE which we callRodent. We show that it is possible to learn not onlyone specific model for a single process, but a manifold of models representingharmonic signals in general.

Exploring Non-Additive Distortion in Steganography

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, A.D.
Publikace: Proceedings of the 6th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2018. p. 109-114. ISBN 978-1-4503-5625-1.
Rok: 2018

DOI: 10.1145/3206004.3206015
Odkaz: https://doi.org/10.1145/3206004.3206015
Pracoviště: Centrum umělé inteligence
Anotace:
Leading steganography systems make use of the Syndrome-Trellis Code (STC) algorithm to minimize a distortion function while encoding the desired payload, but this constrains the distortion function to be additive. The Gibbs Embedding algorithm works for a certain class of non-additive distortion functions, but has its own limitations and is highly complex. In this short paper we show that it is possible to modify the STC algorithm in a simple way, to minimize a non-additive distortion function suboptimally. We use it for two examples. First, applying it to the S-UNIWARD distortion function, we show that it does indeed reduce distortion, compared with minimizing the additive approximation currently used in image steganography, but that it makes the payload more -- not less -- detectable. This parallels research attempting to use Gibbs Embedding for the same task. Second, we apply it to distortion defined by the output of a specific detector, as a counter-move in the steganography game. However, unless the Warden is forced to move first (by fixing the detector) this is highly detectable.

Multiple instance learning for malware classification

Autoři: Stiborek, J., doc. Ing. Tomáš Pevný, Ph.D., Rehák, M.
Publikace: Expert Systems with Applications. 2018, 2018(93), 346-357. ISSN 0957-4174.
Rok: 2018

DOI: 10.1016/j.eswa.2017.10.036
Odkaz: https://doi.org/10.1016/j.eswa.2017.10.036
Pracoviště: Centrum umělé inteligence
Anotace:
This work addresses classification of unknown binaries executed in sandbox by modeling their interaction with system resources (files, mutexes, registry keys and communication with servers over the network) and error messages provided by the operating system, using vocabulary-based method from the multiple instance learning paradigm. It introduces similarities suitable for individual resource types that combined with an approximative clustering method efficiently group the system resources and define features directly from data. This approach effectively removes randomization often employed by malware authors and projects samples into low-dimensional feature space suitable for common classifiers. An extensive comparison to the state of the art on a large corpus of binaries demonstrates that the proposed solution achieves superior results using only a fraction of training samples. Moreover, it makes use of a source of information different than most of the prior art, which increases the diversity of tools detecting the malware, hence making detection evasion more difficult.

Network traffic fingerprinting based on approximated kernel two-sample test

Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Information Forensics and Security. 2018, 13(3), 788-801. ISSN 1556-6013.
Rok: 2018

DOI: 10.1109/TIFS.2017.2768018
Odkaz: https://doi.org/10.1109/TIFS.2017.2768018
Pracoviště: Centrum umělé inteligence
Anotace:
Abstract: Many applications and communication protocols exhibit unique communication patterns that can be exploited to identify them in network traffic. This work proposes a method to represent these patterns compactly such that they can be used in different analytical tasks. The method treats each communication as a set of observations of a random variable with unknown probability distribution. This view allows to derive the representation from a distance between two probability distributions used in Maximum Mean Discrepancy — a non-parametric kernel test. The representation (and distance) can be then easily used in various algorithms for identification of communicating application and data analysis, independently of the specific type of input data.

Probabilistic analysis of dynamic malware traces

Autoři: Stiborek, J., doc. Ing. Tomáš Pevný, Ph.D., Rehák, M.
Publikace: Computers & Security. 2018, 74 221-239. ISSN 1872-6208.
Rok: 2018

DOI: 10.1016/j.cose.2018.01.012
Odkaz: https://doi.org/10.1016/j.cose.2018.01.012
Pracoviště: Centrum umělé inteligence
Anotace:
We propose a method to automatically group unknown binaries executed in sandbox according to their interaction with system resources (files on the filesystem, mutexes, registry keys, network communication with remote servers and error messages generated by operating system) such that each group corresponds to a malware family. The method utilizes probabilistic generative model (Bernoulli mixture model), which allows human-friendly prioritization of identified clusters and extraction of readable behavioral indicators to maximize interpretability. We compare it to relevant prior art on a large set of malware binaries where a quality of cluster prioritization and automatic extraction of indicators of compromise is demonstrated. The proposed approach therefore implements complete pipeline which has the potential to significantly speed-up analysis of unknown samples.

Malware Detection by Analysing Encrypted Network Traffic with Neural Networks

Autoři: Prasse, P., Machlica, L., doc. Ing. Tomáš Pevný, Ph.D., Havelka, J., Scheffer, T.
Publikace: Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing AG, 2017. p. 73-88. vol. I, II, III. ISSN 0302-9743. ISBN 978-3-319-71245-1.
Rok: 2017

DOI: 10.1007/978-3-319-71246-8_5
Odkaz: https://doi.org/10.1007/978-3-319-71246-8_5
Pracoviště: Centrum umělé inteligence
Anotace:
We study the problem of detecting malware on client computers based on the analysis of HTTPS traffic. Here, malware has to be detected based on the host address, timestamps, and data volume information of the computer’s network traffic. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural embedding of domain names and a long short-term memory network that processes network flows. We study the method’s ability to detect new malware in a large-scale empirical study.

Optimal Strategies for Detecting Data Exfiltration by Internal and External Attackers

Autoři: Durkota, K., doc. Mgr. Viliam Lisý, MSc., Ph.D., Kiekintveld, C., Ing. Karel Horák, Ph.D., doc. Mgr. Branislav Bošanský, Ph.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Düsseldorf: Springer VDI Verlag, 2017. p. 171-192. ISSN 0302-9743. ISBN 978-3-319-68710-0.
Rok: 2017

DOI: 10.1007/978-3-319-68711-7_10
Odkaz: https://doi.org/10.1007/978-3-319-68711-7_10
Pracoviště: Katedra počítačů, Centrum umělé inteligence
Anotace:
We study the problem of detecting data exfiltration in computer networks. We focus on the performance of optimal defense strategies with respect to an attacker’s knowledge about typical network behavior and his ability to influence the standard traffic. Internal attackers know the typical upload behavior of the compromised host and may be able to discontinue standard uploads in favor of the exfiltration. External attackers do not immediately know the behavior of the compromised host, but they can learn it from observations.We model the problem as a sequential game of imperfect information, where the network administrator selects the thresholds for the detector, while the attacker chooses how much data to exfiltrate in each time step. We present novel algorithms for approximating the optimal defense strategies in the form of Stackelberg equilibria. We analyze the scalability of the algorithms and efficiency of the produced strategies in a case study based on real-world uploads of almost six thousand users to Google Drive. We show that with the computed defense strategies, the attacker exfiltrates 2–3 times less data than with simple heuristics; randomized defense strategies are up to 30% more effective than deterministic ones, and substantially more effective defense strategies are possible if the defense is customized for groups of hosts with similar behavior.

Reducing False Positives of Network Anomaly Detection by Local Adaptive Multivariate Smoothing

Autoři: Grill, M., doc. Ing. Tomáš Pevný, Ph.D., Rehák, M.
Publikace: Journal of Computer and System Sciences. 2017, 83(1), 43-57. ISSN 0022-0000.
Rok: 2017

DOI: 10.1016/j.jcss.2016.03.007
Odkaz: https://doi.org/10.1016/j.jcss.2016.03.007
Pracoviště: Centrum umělé inteligence
Anotace:
Network intrusion detection systems based on the anomaly detection paradigm have high false alarm rate making them difficult to use. To address this weakness, we propose to smooth the outputs of anomaly detectors by online Local Adaptive Multivariate Smoothing (LAMS). LAMS can reduce a large portion of false positives introduced by the anomaly detection by replacing the anomaly detector's output on a network event with an aggregate of its output on all similar network events observed previously. The arguments are supported by extensive experimental evaluation involving several anomaly detectors in two domains: NetFlow and proxy logs. Finally, we show how the proposed solution can be efficiently implemented to process large streams of non-stationary data.

Using Neural Network Formalism to Solve Multiple-Instance Problems

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Somol, P.
Publikace: Advances in Neural Networks - ISNN 2017. Wien: Springer, 2017. p. 135-142. LNCS. vol. 10261. ISSN 0302-9743. ISBN 978-3-319-59071-4.
Rok: 2017

DOI: 10.1007/978-3-319-59072-1_17
Odkaz: https://doi.org/10.1007/978-3-319-59072-1_17
Pracoviště: Centrum umělé inteligence
Anotace:
a fixed length, whereas describing them by means of a set of vectors is more natural. Therefore, Multiple instance learning (MIL) techniques have been constantly gaining in importance throughout the last years. MIL formalism assumes that each object (sample) is represented by a set (bag) of feature vectors (instances) of fixed length, where knowledge about objects (e.g., class label) is available on bag level but not necessarily on instance level. Many standard tools including supervised classifiers have been already adapted to MIL setting since the problem got formalized in the late nineties. In this work we propose a neural network (NN) based formalism that intuitively bridges the gap between MIL problem definition and the vast existing knowledge-base of standard models and classifiers. We show that the proposed NN formalism is effectively optimizable by a back-propagation algorithm and can reveal unknown patterns inside bags. Comparison to 14 types of classifiers from the prior art on a set of 20 publicly available benchmark datasets confirms the advantages and accuracy of the proposed solution.

Discriminative Models for Multi-instance Problems with Tree Structure

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Somol, Petr
Publikace: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2016. pp. 83-91. ISBN 978-1-4503-4573-6.
Rok: 2016

DOI: 10.1145/2996758.2996761
Odkaz: https://doi.org/10.1145/2996758.2996761
Pracoviště: Centrum umělé inteligence
Anotace:
Modelling network traffic is gaining importance to counter modern security threats of ever increasing sophistication. It is though surprisingly difficult and costly to construct reliable classifiers on top of telemetry data due to the variety and complexity of signals that no human can manage to interpret in full. Obtaining training data with sufficiently large and variable body of labels can thus be seen as a prohibitive problem. The goal of this work is to detect infected computers by observing their HTTP(S) traffic collected from network sensors, which are typically proxy servers or network firewalls, while relying on only minimal human input in the model training phase. We propose a discriminative model that makes decisions based on a computer's all traffic observed during a predefined time window (5 minutes in our case). The model is trained on traffic samples collected over equally-sized time windows for a large number of computers, where the only labels needed are (human) verdicts about the computer as a whole (presumed infected vs. presumed clean). As part of training, the model itself learns discriminative patterns in traffic targeted to individual servers and constructs the final high-level classifier on top of them. We show the classifier to perform with very high precision, and demonstrate that the learned traffic patterns can be interpreted as Indicators of Compromise. We implement the discriminative model as a neural network with special structure reflecting two stacked multi instance problems. The main advantages of the proposed configuration include not only improved accuracy and ability to learn from gross labels, but also automatic learning of server types (together with their detectors) that are typically visited by infected computers.

Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce

Autoři: Čech, Přemysl, Kohout, J., Lokoč, Jakub, Komárek, T., Maroušek, Jakub, doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Similarity Search and Applications. Basel: Springer, 2016. p. 311-324. 9939. ISSN 0302-9743. ISBN 978-3-319-46758-0.
Rok: 2016

DOI: 10.1007/978-3-319-46759-7_24
Odkaz: https://doi.org/10.1007/978-3-319-46759-7_24
Pracoviště: Centrum umělé inteligence
Anotace:
Secure HTTP network traffic represents a challenging immense data source for machine learning tasks. The tasks usually try to learn and identify infected network nodes, given only limited traffic features available for secure HTTP data. In this paper, we investigate the performance of grid histograms that can be used to aggregate traffic features of network nodes considering just 5-min batches for snapshots. We compare the representation using linear and k-NN classifiers. We also demonstrate that all presented feature extraction and classification tasks can be implemented in a scalable way using the MapReduce approach.

k-NN Classification of Malware in HTTPS Traffic Using the Metric Space Approach

Autoři: Lokoč, J., Kohout, J., Čech, P., Skopal, T., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Intelligence and Security Informatics. Düsseldorf: Springer VDI Verlag, 2016. p. 131-145. ISSN 0302-9743. ISBN 978-3-319-31862-2.
Rok: 2016

DOI: 10.1007/978-3-319-31863-9_10
Odkaz: https://doi.org/10.1007/978-3-319-31863-9_10
Pracoviště: Centrum umělé inteligence
Anotace:
In this paper, we present detection of malware in HTTPS traffic using k-NN classification. We focus on the metric space approach for approximate k-NN searches over dataset of sparse high-dimensional descriptors of network traffic. We show the classification based on approximate k-NN search using metric index exhibits false positive rate reduced by an order of magnitude when compared to the state of the art method, while keeping the classification fast enough.

Learning Combination of Anomaly Detectors for Security Domain

Autoři: Grill, M., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Computer Networks. 2016, 107(1), 55-63. ISSN 1389-1286.
Rok: 2016

DOI: 10.1016/j.comnet.2016.05.021
Odkaz: https://doi.org/10.1016/j.comnet.2016.05.021
Pracoviště: Centrum umělé inteligence
Anotace:
This paper presents a novel technique of finding a convex combination of outputs of anomaly detectors maximizing the accuracy in τ-quantile of most anomalous samples. Such an approach better reflects the needs in the security domain in which subsequent analysis of alarms is costly and can be done only on a small number of alarms. An extensive experimental evaluation and comparison to prior art on real network data using sets of anomaly detectors of two existing intrusion detection systems shows that the proposed method not only outperforms prior art, it is also more robust to noise in training data labels, which is another important feature for deployment in practice.

Loda: Lightweight on-line detector of anomalies

Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Machine Learning. 2016, 102(2), 275-304. ISSN 0885-6125.
Rok: 2016

DOI: 10.1007/s10994-015-5521-0
Odkaz: https://doi.org/10.1007/s10994-015-5521-0
Pracoviště: Katedra počítačů
Anotace:
In supervised learning it has been shown that a collection of weak classifiers can result in a strong classifier with error rates similar to those of more sophisticated methods. In unsupervised learning, namely in anomaly detection such a paradigm has not yet been demonstrated despite the fact that many methods have been devised as counterparts to supervised binary classifiers. This work partially fills the gap by showing that an ensemble of very weak detectors can lead to a strong anomaly detector with a performance equal to or better than state of the art methods. The simplicity of the proposed ensemble system (to be called Loda) is particularly useful in domains where a large number of samples need to be processed in real-time or in domains where the data stream is subject to concept drift and the detector needs to be updated on-line. Besides being fast and accurate, Loda is also able to operate and update itself on data with missing variables. Loda is thus practical in domains with sensor outages. Moreover, Loda can identify features in which the scrutinized sample deviates from the majority. This capability is useful when the goal is to find out what has caused the anomaly. It should be noted that none of these favorable properties increase Loda’s low time and space complexity. We compare Loda to several state of the art anomaly detectors in two settings: batch training and on-line training on data streams. The results on 36 datasets from UCI repository illustrate the strengths of the proposed system, but also provide more insight into the more general questions regarding batch-vs-on-line anomaly detection.

Malicons: Detecting Payload in Favicons

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M., Křoustek, J, Ker, A.D
Publikace: Media Watermarking, Security, and Forensics 2016. Society for Imaging Science and Technology, 2016. ISSN 2470-1173.
Rok: 2016

DOI: 10.2352/ISSN.2470-1173.2016.8.MWSF-079
Odkaz: https://doi.org/10.2352/ISSN.2470-1173.2016.8.MWSF-079
Pracoviště: Katedra počítačů
Anotace:
A recent version of the "Vawtrak" malware used steganography to hide the addresses of the command and control channels in favicons: small images automatically downloaded by the web browser. Since almost all research in steganalysis focuses on natural images, we study how well these methods can detect secret messages in favicons. The study is performed on a large corpus of favicons downloaded from the internet and applies a number of state-of-art steganalysis techniques, as well as proposing very simple novel features that exploit flat areas in favicons. The ultimate question is whether we can detect Vawtrak's steganographic favicons with a sufficiently low false positive rate.

Passive NAT Detection Using HTTP Access Logs

Autoři: Komárek, T., Grill, M., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: 2016 IEEE International Workshop on Information Forensics and Security. IEEE, 2016. ISSN 2157-4774. ISBN 978-1-5090-1138-4.
Rok: 2016

DOI: 10.1109/WIFS.2016.7823896
Odkaz: https://doi.org/10.1109/WIFS.2016.7823896
Pracoviště: Centrum umělé inteligence
Anotace:
Network devices performing Network Address Translation (NAT) overcome the problem of the deficit of IPv4 addresses as well as introduce a vulnerability to the network with possibly insecure configurations. Therefore detection of unauthorized NAT devices is an important task in the network security domain. In this paper, a novel passive NAT detection algorithm is proposed that identifies NAT devices in the network using statistical behavior analysis. We model behavior of network hosts using eight features extracted from HTTP access logs. These features are collected within consecutive non-overlapping time windows covering last 24 hours. To classify whether a host is a NAT device or an end host (non-NAT device) a pre-trained linear classifier is used. Since labeled data for training purposes is hard to obtain, we also propose a way how to generate the training data from unlabeled traffic logs.

Rethinking Optimal Embedding

Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D., Bas, Patrick
Publikace: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2016. pp. 93-102. ISBN 978-1-4503-4290-2.
Rok: 2016

DOI: 10.1145/2909827.2930797
Odkaz: https://doi.org/10.1145/2909827.2930797
Pracoviště: Centrum umělé inteligence
Anotace:
At present, almost all leading steganographic techniques for still images use a distortion minimization paradigm, where each potential change is assigned a cost ci and the change probabilities πi chosen to minimize the average total cost ∑iπici. However, some detectors have exploited knowledge of this adaptivity and the embedding cannot be considered optimal. In this work we prove a theoretical result suggesting that, against a knowing attacker, the embedder should simply minimize ∑iπ2ici instead, for the same costs ci, which is the minimax and equilibrium strategy. This aligns with some special case results that have appeared in recent literature. We then test some simple steganographic methods in theoretical and real settings, showing that naive (average cost) adaptivity is exploitable, but the equilibrium probabilities cannot be exploited. However, it is essential to determine statistically well-founded costs ci.

Using Behavioral Similarity for Botnet Command-and-Control Discovery

Autoři: Jusko, J., Rehák, M., Stiborek, J., Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Intelligent Systems. 2016, 31(5), 16-22. ISSN 1541-1672.
Rok: 2016

DOI: 10.1109/MIS.2016.88
Odkaz: https://doi.org/10.1109/MIS.2016.88
Pracoviště: Centrum umělé inteligence
Anotace:
Malware authors and operators typically collaborate to achieve the optimal profit. They also frequently change their behavior and resources to avoid detection. The authors propose a social similarity metrics that exploits these relationships to improve the effectiveness and stability of the threat propagation algorithm typically used to discover malicious collaboration. Furthermore, they propose behavioral modeling as a way to group similarly behaving servers, enabling extension of the ground truth that's so expensive to obtain in the field of network security. The authors also show that seeding the threat propagation algorithm from a set of coherently behaving servers (instead of from a single known malicious server identified by threat intelligence) makes the algorithm far more effective and significantly more robust, without compromising the precision of findings.

Automatic Discovery of Web Servers Hosting Similar Applications

Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). Piscataway: IEEE, 2015. p. 1310-1315. ISBN 978-3-901882-76-0.
Rok: 2015

DOI: 10.1109/INM.2015.7140487
Odkaz: https://doi.org/10.1109/INM.2015.7140487
Pracoviště: Katedra počítačů
Anotace:
Increasingly more popular cloud services have frequently many functional parts, which makes their structure rather complex yet its understanding improves network monitoring for security purposes, traffic routing, etc. Since the structure of third-party services is typically unknown, automated tools for its discovery are of great need. In this work, we propose such tool relying only on high-level statistics of servers' usage, such as volumes and times of interactions with the servers. Without looking into the communication contents, the method works for encrypted channels as well, which is experimentally demonstrated on Dropbox service and Windows Live platform.

Finding New Malicious Domains Using Variational Bayes on Large-Scale Computer Network Data

Autoři: Létal, V., doc. Ing. Tomáš Pevný, Ph.D., Somol, Petr, Smidl, Vasek
Publikace: Proceedings of NIPS workshop on Advances in Approximate Bayesian Inference. Montreal: Neural Information Processing Society, 2015, Available from: http://www.approximateinference.org/accepted/LetalEtAl2015.pdf
Rok: 2015

Pracoviště: Katedra počítačů
Anotace:
The common limitation in computer network security is the reactive nature of defenses. A new type of infection typically needs to be first observed live, be- fore defensive measures can be taken. To improve the pro-active measures, we have developed a method utilizing WHOIS database (database of entities that has registered a particular domain) to model relations between domains even those not yet used. The model estimates the probability of a domain name being used for malicious purposes from observed connections to other related domains. The parameters of the model is inferred by a Variational Bayes method, and its effec- tiveness is demonstrated on a large-scale network data with millions of domains and trillions of connections to them.

Is Ensemble Classifier Needed for Steganalysis in High-Dimensional Feature Spaces?

Autoři: Cogranne, Remi, Sedighi, Vahid, Fridrich, Jessica, doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 7th International Workshop on Forensics and Security. New Jersey: IEEE Signal Processing Society, 2015. pp. 1-6. ISBN 978-1-4673-6802-5.
Rok: 2015

DOI: 10.1109/WIFS.2015.7368597
Odkaz: https://doi.org/10.1109/WIFS.2015.7368597
Pracoviště: Katedra počítačů
Anotace:
The ensemble classifier, based on Fisher Linear Discriminant base learners, was introduced specifically for steganalysis of digital media, which currently uses high-dimensional feature spaces. Presently it is probably the most used method to design supervised classifier for steganalysis of digital images because of its good detection accuracy and small computational cost. It has been assumed by the community that the classifier implements a non-linear boundary through pooling binary decision of individual classifiers within the ensemble. This paper challenges this assumption by showing that linear classifier obtained by various regularizations of the FLD can perform equally well as the ensemble. Moreover it demonstrates that using state of the art solvers linear classifiers can be trained more efficiently and offer certain potential advantages over the original ensemble leading to much lower computational complexity than the ensemble classifier. All claims are supported experimentally on a wide spectrum of stego schemes operating in both the spatial and JPEG domains with a multitude of rich steganalysis feature sets.

Optimizing pooling function for pooled steganalysis

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Nikolaev, I.
Publikace: Proceedings of the 7th International Workshop on Forensics and Security. New Jersey: IEEE Signal Processing Society, 2015. pp. 1-6. ISBN 978-1-4673-6802-5.
Rok: 2015

DOI: 10.1109/WIFS.2015.7368555
Odkaz: https://doi.org/10.1109/WIFS.2015.7368555
Pracoviště: Katedra počítačů
Anotace:
Pooled steganalysis combines evidence from multiple objects to achieve higher accuracy in detecting hidden messages at the expense of granularity, as the decision is provided on the set of objects instead of a single one. Although it has been introduced almost decade ago, very little work has been done since then. This work builds upon recent advances in machine learning to show, how an optimal function combining outputs of a single object detector on a set of objects can be learned. Although experiments demonstrate that learned combining functions are superior to the prior art, more importantly they reveal many interesting phenomenons and points to direction of further research.

Towards dependable steganalysis

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
Publikace: Proceedings of SPIE Media Watermarking, Security, and Forensics 2015. Bellingham (stát Washington): SPIE, 2015. Proceedings of SPIE. ISSN 0277-786X. ISBN 978-1-62841-499-8.
Rok: 2015

DOI: 10.1117/12.2083216
Odkaz: https://doi.org/10.1117/12.2083216
Pracoviště: Katedra počítačů
Anotace:
This paper considers the research goal of dependable steganalysis: where false positives occur once in a million or less, and this rate is known with high precision. Despite its importance for real-world application, there has been almost no study of steganalysis which produces very low false positives. We test existing and novel classifiers for their low false-positive performance, using millions of images from Flickr. Experiments on such a scale require considerable engineering. Standard steganalysis classifiers do not perform well in a low false-positive regime, and we make new proposals to penalize false positives more than false negatives.

Towards Scalable Network Host Simulation

Autoři: Stiborek, J., Rehák, M., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Autonomous Agents and Multiagent Systems. County of Richland: IFAAMAS, 2015, 14. ISSN 1548-8403. ISBN 978-1-4503-3771-7. Available from: http://www.lancaster.ac.uk/staff/suchj/acyse2015-proceedings/ACySe2015_submission_Stiborek.pdf
Rok: 2015

Pracoviště: Katedra počítačů
Anotace:
Anomaly detection techniques in network security face signicant challenges on conguration and evaluation, as collecting data for accurate analysis is dicult or nearly impossible. One viable approach is to avoid live data collection and replace if by the agent-based simulation of the network trac with models of user's behavior. In this paper we propose three approaches diering by the level of detail with which user behavior is modeled.

Unsupervised Detection of Malware in Persistent Web Traffic

Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2015. pp. 1757-1761. ISSN 1520-6149. ISBN 978-1-4673-6997-8.
Rok: 2015

DOI: 10.1109/ICASSP.2015.7178272
Odkaz: https://doi.org/10.1109/ICASSP.2015.7178272
Pracoviště: Katedra počítačů
Anotace:
Persistent network communication can be found in many instances of malware. In this paper, we analyse the possibility of leveraging low variability of persistent malware communication for its detection. We propose a new method for capturing statistical fingerprints of connections and employ outlier detection to identify the malicious ones. Emphasis is put on using minimal information possible to make our method very lightweight and easy to deploy. Anomaly detection is commonly used in network security, yet to our best knowledge, there are not many works focusing on the persistent communication itself, without making further assumptions about its purpose.

A Memory Efficient Privacy Preserving Representation of Connection Graphs

Autoři: Rehák, M., Jusko, J., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: ACySE '14 Proceedings of the 1st International Workshop on Agents and CyberSecurity. New York: ACM, 2014. ISBN 978-1-4503-2728-2.
Rok: 2014

DOI: 10.1145/2602945.2602947
Odkaz: https://doi.org/10.1145/2602945.2602947
Pracoviště: Katedra počítačů
Anotace:
Connection graphs are often used for network traffic classification and P2P networks analysis. With the appearance of Software Defined Networks (SDN), a novel approach to proactive distributed network management based on multi-agent paradigm, there is a need to develop specialized graph representations. Once transmitted between elements of SDN network, they provide answers to specific queries while protecting other information about the graph. In this paper we propose one such graph representation based on Bloom Filters and show that it provides considerable reduction of required memory and strong privacy while keeping low false positive rate that does not have negative impact on its intended use.

A mishmash of methods for mitigating the model mismatch mess

Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of SPIE Media Watermarking, Security, and Forensics 2014. Washington: SPIE, 2014, ISSN 0277-786X. ISBN 978-0-8194-9945-5. Available from: http://dx.doi.org/10.1117/12.2038908
Rok: 2014

DOI: 10.1117/12.2038908
Odkaz: https://doi.org/10.1117/12.2038908
Pracoviště: Katedra počítačů
Anotace:
The model mismatch problem occurs in steganalysis when a binary classifier is trained on objects from one cover source and tested on another: an example of domain adaptation. It is highly realistic because a steganalyst would rarely have access to much or any training data from their opponent, and its consequences can be devastating to classifier accuracy. This paper presents an in-depth study of one particular instance of model mismatch, in a set of images from Flickr using one fixed steganography and steganalysis method, attempting to separate different effects of mismatch in feature space and find methods of mitigation where possible. We also propose new benchmarks for accuracy, which are more appropriate than mean error rates when there are multiple actors and multiple images, and consider the case of 3-valued detectors which also output `don't know'. This pilot study demonstrates that some simple feature-centering and ensemble methods can reduce the mismatch penalty considerably, but not completely remove it.

Explaining Anomalies with Sampling Random Forests

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M.
Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. p. 71-78. ISBN 978-80-87136-19-5.
Rok: 2014

Pracoviště: Centrum umělé inteligence
Anotace:
The main objective of anomaly detection algo- rithms is finding samples deviating from the majority. Al- though a vast number of algorithms designed for this al- ready exist, almost none of them explain, why a particular sample was labelled as an anomaly. To address this is- sue, we propose an algorithm called Explainer, which re- turns the explanation of sample’s differentness in disjunc- tive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algo- rithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data- streams, and real-time problems. The correctness of Ex- plainer is demonstrated on a wide range of synthetic and real world datasets.

Explaining Anomalies with Sapling Random Forests

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M.
Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. pp. 71-78. ISBN 978-80-87136-19-5.
Rok: 2014

Pracoviště: Katedra počítačů
Anotace:
The main objective of anomaly detection algo- rithms is finding samples deviating from the majority. Al- though a vast number of algorithms designed for this al- ready exist, almost none of them explain, why a particular sample was labelled as an anomaly. To address this is- sue, we propose an algorithm called Explainer, which re- turns the explanation of sample’s differentness in disjunc- tive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algo- rithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data- streams, and real-time problems. The correctness of Ex- plainer is demonstrated on a wide range of synthetic and real world datasets.

Interpreting and clustering outliers with sapling random forests

Autoři: Kopp, M., doc. Ing. Tomáš Pevný, Ph.D., Holeňa, M.
Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. pp. 61-67. ISBN 978-80-87136-19-5.
Rok: 2014

Pracoviště: Katedra počítačů
Anotace:
The main objective of outlier detection is find- ing samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover in- teresting events within data. Consequently, a considerable amount of statistical and data mining techniques to iden- tify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was la- belled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as con- junctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by hu- mans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anoma- lies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our ap- proach is demonstrated on several synthetic and one real world datasets.

Interpreting and clustering outliers with sapling random forests

Autoři: Kopp, M., doc. Ing. Tomáš Pevný, Ph.D., Holeňa, M.
Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. p. 61-67. ISBN 978-80-87136-19-5.
Rok: 2014

Pracoviště: Centrum umělé inteligence
Anotace:
The main objective of outlier detection is find- ing samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover in- teresting events within data. Consequently, a considerable amount of statistical and data mining techniques to iden- tify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was la- belled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as con- junctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by hu- mans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anoma- lies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our ap- proach is demonstrated on several synthetic and one real world datasets.

Randomized Operating Point Selection in Adversarial Classification

Autoři: doc. Mgr. Viliam Lisý, MSc., Ph.D., Kessl, R., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2013, part II. Heidelberg: Springer, 2014. pp. 240-255. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-662-44850-2.
Rok: 2014

DOI: 10.1007/978-3-662-44851-9_16
Odkaz: https://doi.org/10.1007/978-3-662-44851-9_16
Pracoviště: Katedra počítačů
Anotace:
Security systems for email spam filtering, network intrusion detection, steganalysis, and watermarking, frequently use classifiers to separate malicious behavior from legitimate. Typically, they use a fixed operating point minimizing the expected cost / error. This allows a rational attacker to deliver invisible attacks just below the detection threshold. We model this situation as a non-zero sum normal form game capturing attacker’s expected payoffs for detected and undetected attacks, and detector’s costs for false positives and false negatives computed based on the Receiver Operating Characteristic (ROC) curve of the classifier. The analysis of Nash and Stackelberg equilibria reveals that using a randomized strategy over multiple operating points forces the rational attacker to design less efficient attacks and substantially lowers the expected cost of the detector. We present the equilibrium strategies for sample ROC curves from network intrusion detection system and evaluate the corresponding benefits.

Steganographic key leakage through payload metadata

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
Publikace: Proceedings of the 2nd ACM workshop on Information hiding and multimedia security. New York: ACM, 2014. pp. 109-114. ISBN 978-1-4503-2647-6.
Rok: 2014

DOI: 10.1145/2600918.2600921
Odkaz: https://doi.org/10.1145/2600918.2600921
Pracoviště: Katedra počítačů
Anotace:
The only steganalysis attack which can provide absolute certainty about the presence of payload is one which finds the embedding key. In this paper we consider refined versions of the key exhaustion attack exploiting metadata such as message length or decoding matrix size, which must be stored along with the payload. We show simple errors of implementation lead to leakage of key information and powerful inference attacks; furthermore, complete absence of information leakage seems difficult to avoid. This topic has been somewhat neglected in the literature for the last ten years, but must be considered in real-world implementations.

The Steganographer is the Outlier: Realistic Large-Scale Steganalysis

Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: IEEE Transactions on Information Forensics and Security. 2014, 9(9), 1424-1435. ISSN 1556-6013.
Rok: 2014

DOI: 10.1109/TIFS.2014.2336380
Odkaz: https://doi.org/10.1109/TIFS.2014.2336380
Pracoviště: Katedra počítačů
Anotace:
We present a method for a completely new kind of steganalysis to determine who, out of a large number of actors each transmitting a large number of objects, is hiding payload inside some of them. It has significant challenges, including unknown embedding parameters and natural deviation between innocent cover sources, which are usually avoided in steganalysis tested under laboratory conditions. Our method uses standard steganalysis features, the maximum mean discrepancy measure of distance, and ranks the actors by their degree of deviation from the rest: we show that it works reliably, completely unsupervised, when tested against some of the standard steganography methods available to nonexperts. We also determine good parameters for the detector and show that it creates a two-player game between the guilty actor and the steganalyst.

Anomaly detection by bagging

Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Solving Complex Machine Learning Problems with Ensemble Methods. 2013, pp. 25-40. Available from: http://ama.imag.fr/COPEM/copem2013_proceedings.pdf
Rok: 2013

Pracoviště: Katedra počítačů
Anotace:
Many contemporary domains, e.g. network intrusion detection, fraud detection, etc., call for an anomaly detector processing a continuous stream of data. This need is driven by the high rate of their acquisition, by limited resources for storing them, or by privacy issues. The data can be also non-stationary requiring the detector to continuously adapt to their change. A good detector for these domains should therefore have a low training and classification complexity, on-line training algorithm, and, of course, a good detection accuracy. This paper proposes a detector trying to meet all these criteria. The detector consists of multiple weak detectors, each implemented as a one dimensional histogram. The one-dimensional histogram was chosen because it can be efficiently created on-line, and probability estimates can be efficiently retrieved from it. This construction gives the detector linear complexity of training and classification with respect to the input dimension, number of samples, and number of weak detectors. The accuracy of the detector is compared to seven anomaly detectors from the prior art on the range of 36 classification problems from UCI database. Results show that despite detector's simplicity, its accuracy is competitive to that of more complex detectors with a substantially higher computational complexity.

Attacking the IDS learning processes

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Rehák, M., Komon, Martin
Publikace: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. Piscataway: IEEE, 2013. pp. 8687-8691. ISSN 1520-6149. ISBN 9781479903566.
Rok: 2013

DOI: 10.1109/ICASSP.2013.6639362
Odkaz: https://doi.org/10.1109/ICASSP.2013.6639362
Pracoviště: Katedra počítačů
Anotace:
Abstract We study the problem of directed attacks on the learning process of an anomaly-based Intrusion Detection System (IDS). We assume that the attack is performed by a knowledgeable attacker with an access to system's inputs, outputs, and all internal states. The attacker uses his knowledge of the IDS (implemented as an ensemble of anomaly detection algorithms) and its internal states to design the strongest undetectable attack of a particular type. We have experimented with different attacks against several anomaly detection algorithms individually, and against their combination. We show that while the individual anomaly detection algorithms can be easily avoided by the worst-case attacker that we assume, it is nearly impossible to avoid them simultaneously. These results were achieved during the experiments performed on university network traffic and are consistent with theoretical hypothesis grounded in steganalysis and watermarking.

Moving Steganography and Steganalysis from the Laboratory into the Real World

Autoři: Ker, Andrew, Bas, Patrick, Bohme, Rainer, Cogranne, Remi, Craver, Scott, Filler, Tomas, Fridrich, Jessica, doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the first ACM workshop on Information hiding and multimedia security. New York: ACM Press, 2013. pp. 45-58. ISBN 978-1-4503-2081-8.
Rok: 2013

DOI: 10.1145/2482513.2482965
Odkaz: https://doi.org/10.1145/2482513.2482965
Pracoviště: Katedra počítačů
Anotace:
There has been an explosion of academic literature on steganography and steganalysis in the past two decades. With a few exceptions, such papers address abstractions of the hiding and detection problems, which arguably have become disconnected from the real world. Most published results, including by the authors of this paper, apply "in laboratory conditions" and some are heavily hedged by assumptions and caveats; significant challenges remain unsolved in order to implement good steganography and steganalysis in practice. This position paper sets out some of the important questions which have been left unanswered, as well as highlighting some that have already been addressed successfully, for steganography and steganalysis to be used in the real world.

The Challenges of Rich Features in Universal Steganalysis

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
Publikace: Media Watermarking, Security, and Forensics 2013. Washington: SPIE, 2013. ISSN 0277-786X. ISBN 9780819494382.
Rok: 2013

DOI: 10.1117/12.2006790
Odkaz: https://doi.org/10.1117/12.2006790
Pracoviště: Katedra počítačů
Anotace:
Contemporary steganalysis is driven by new steganographic rich feature sets, which consist of large numbers of weak features. Although extremely powerful when applied to supervised classification problems, they are not compatible with unsupervised universal steganalysis, because the unsupervised method cannot separate the signal (evidence of steganographic embedding) from the noise (cover content). This work tries to alleviate the problem, by means of feature extraction algorithms. We focus on linear projections informed by embedding methods, and propose a new method which we call calibrated least squares with the specific aim of making the projections sensitive to stego content yet insensitive to cover variation. Different projections are evaluated by their application to the anomaly detector from Ref. 1, and we are able to retain both the universality and the robustness of the method, while increasing its performance substantially.

Batch steganography in the real world

Autoři: Ker, Andrew, doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of ACM Workshop on Multimedia and Security. New York: ACM Press, 2012, pp. 1-10. ISBN 978-1-4503-1417-6.
Rok: 2012

DOI: 10.1145/2361407.2361409
Odkaz: https://doi.org/10.1145/2361407.2361409
Pracoviště: Katedra počítačů
Anotace:
We examine the universal pooled steganalyzer of in two respects. First, we confirm that the method is applicable to a number of different steganographic embedding methods. Second, we consider the converse problem of how to spread payload between multiple covers, by testing different payload allocation strategies against the universal steganalyzer. We focus on practical options which can be implemented without new software or expert knowledge, and we test on real-world data. Concentration of payload into the minimal number of covers is consistently the least detectable option. We present additional investigations which explain this phenomenon, uncovering a nonlinear relationship between embedding distortion and payload. We conjecture that this is an unavoidable consequence of blind steganalysis. This is significant for both batch steganography and pooled steganalysis.

Co-occurrence steganalysis in high dimensions

Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of SPIE, Volume 8303. Pennsylvania State University. University Park,: SPIE - International Society for Optical Engineering, 2012. ISSN 0277-786X. ISBN 9780819489500.
Rok: 2012

DOI: 10.1117/12.908914
Odkaz: https://doi.org/10.1117/12.908914
Pracoviště: Katedra počítačů
Anotace:
The state of the art steganalytic features for spatial domain, and to some extent for transfer domains (DCT) as well, are based on histograms of co-occurrence of neighboring elements. The rationale behind is that neighboring pixels in digital images are correlated, which is caused by the smoothness of our world and by the usual image processing. The limitation of the histogram- based features is that they do not scale well with respect to the number of modeled neighboring elements, since the number of histogram bins (hence number of features) depends exponentially on this quality. The remedy adopted by the prior art is to sum values of neighboring bins together, which can be seen as a vector quantization controlled by the position of the quantization centers. so far the quantization centers has been determined manually according to the steganalyst. Here we propose to use Linde, Buso, and Gray algorithm in order to automatically find quantization centers maximizing the detection accuracy of resulting features. The quantization centers found by the proposed algorithm are experimentally compared to the ones used by the prior art in the steganalysis of Hugo algorithm. The results show a non-eligible improvements in the accuracy, especially when more complicated filtes and higher-order histograms are used.

Detecting anomalous network hosts by means of PCA

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Grill, M., Rehák, M.
Publikace: Proceedings of IEEE Workshop on Informations Forensics and Security. Piscataway: IEEE, 2012, pp. 103-106. ISBN 978-1-4673-2285-0.
Rok: 2012

Pracoviště: Katedra počítačů
Anotace:
Abstract--- This paper focuses on the identification of anomalous hosts within a computer network with the motivation to detect attacks and/or other unwanted and suspicious traffic. The proposed detection method does not use content of packets, which enables the method to be used on encrypted networks. Moreover, the method has very low computational complexity allowing fast detection and response important for limitation of potential damages. Abstract--- The proposed method uses entropies of IP addresses and ports to build two complementary models of host's traffic based on principal component analysis. These two models are coupled with two orthogonal anomaly definitions, which gives four different detectors. Abstract--- The methods are evaluated and compared to prior art on one week long capture of traffic on university network. The experiments reveals that no single detector can detect all types of anomalies, which is expected and stresses the importance of ensemble approach towards intrusion detection.

From Blind to Quantitative Steganalysis

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Fridrich, Jessica, Ker, Andrew
Publikace: IEEE Transactions on Information Forensics and Security. 2012, 7(2), 445-454. ISSN 1556-6013.
Rok: 2012

DOI: 10.1109/TIFS.2011.2175918
Odkaz: https://doi.org/10.1109/TIFS.2011.2175918
Pracoviště: Katedra počítačů
Anotace:
A quantitative steganalyzer is an estimator of the number of embedding changes introduced by a specific embedding operation. Since for most algorithms the number of embedding changes correlates with the message length, quantitative steganalyzers are important forensic tools. In this paper, a general method for constructing quantitative steganalyzers from features used in blind detectors is proposed. The core of the method is a support vector regression, which is used to learn the mapping between a feature vector extracted from the investigated object and the embedding change rate. To demonstrate the generality of the proposed approach, quantitative steganalyzers are constructed for a variety of steganographic algorithms in both JPEG transform and spatial domains. The estimation accuracy is investigated in detail and compares favorably with state-of-the-art quantitative steganalyzers.

Identifying a steganographer in realistic and heterogeneous data sets

Autoři: Ker, Andrew, doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of SPIE, Volume 8303. Pennsylvania State University. University Park,: SPIE - International Society for Optical Engineering, 2012. ISSN 0277-786X. ISBN 9780819489500.
Rok: 2012

DOI: 10.1117/12.910565
Odkaz: https://doi.org/10.1117/12.910565
Pracoviště: Katedra počítačů
Anotace:
We consider the problem of universal pooled steganalysis, in which we aim to identify a steganographer who sends many images (some of them innocent) in a network of many other innocent users. The detector must deal with multiple users and multiple images per user, and particularly the differences between cover sources used by different users. Despite being posed for five years, this problem has only previously been addressed by our 2011 paper. We extend our prior work in two ways. First, we present experiments in a new, highly realistic, domain: up to 4000 actors each transmitting up to 200 images, real-world data downloaded from a social networking site. Second, we replace hierarchical clustering by the method called local outlier factor (LOF), giving greater accuracy of detection, and allowing a guilty actor sending moderate payloads to be detected, even amongst thousands of other actors sending hundreds of thousands of images.

"Break Our Steganographic System" --- the ins and outs of organizing BOSS

Autoři: Bas, P., Filler, T., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of the 13th international conference on Information hiding. Heidelberg: Springer, 2011, pp. 59-70. Lecture notes in computer science. ISBN 978-3-642-24177-2.
Rok: 2011

Pracoviště: Katedra počítačů
Anotace:
This paper summarizes the first international challenge on steganalysis called BOSS. We explain the motivations behind the organization of the contest, its rules together with reasons for them, and the steganographic algorithm developed for the contest. Since the image databases created for the contest significantly influenced the development of the contest, they are described in a great detail. Paper also presents detailed analysis of results submitted to the challenge. One of the main difficulty of the contest was the discrepancy between training and testing source of images -- the so-called cover-source mismatch, which forced the participants to design steganalyzers robust w.r.t. a specific source of images. We also point to other practical issues related to designing steganographic systems and give several suggestions for future contests in steganalysis.

A New Paradigm for Steganalysis via Clustering

Autoři: Ker, A.D., doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of SPIE Volume: 7880. Bellingham: SPIE, 2011. p. 78800U-78813U. ISSN 0277-786X. ISBN 978-0-8194-8417-8.
Rok: 2011

Pracoviště: Katedra počítačů
Anotace:
We propose a new paradigm for blind, universal, steganalysis in the case when multiple actors transmit multiple objects, with guilty actors including some stego objects in their transmissions. The method is based on clustering rather than classification, and it is the actors which are clustered rather than their individual transmitted objects. This removes the need for training a classifier, and the danger of training model mismatch. It effectively judges the behaviour of actors by assuming that most of them are innocent: after performing agglomerative hierarchical clustering, the guilty actor(s) are clustered separately from the innocent majority. A case study shows that this works in the case of JPEG images. Although it is less sensitive than steganalysis based on specifically-trained classifiers, it requires no training, no knowledge of the embedding algorithm, and attacks the pooled steganalysis problem.

Detecting messages of unknown length

Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
Publikace: Proceedings of SPIE Volume: 7880. Bellingham: SPIE, 2011. pp. 78800T-78812T. ISSN 0277-786X. ISBN 978-0-8194-8417-8.
Rok: 2011

Pracoviště: Katedra počítačů
Anotace:
This work focuses on the problem of developing a blind steganalyzer (a steganalyzer relying on machine learning algorithm and steganalytic features) for detecting stego images with different payload. This problem is highly relevant for practical forensic analysis, since in practice, the knowledge about the steganographic channel is very limited, and the length of hidden message is generally unknown. This paper demonstrates that the discrepancy between payload in training and testing / application images can significantly decrease the accuracy of the steganalysis. Two fundamentally different approaches to mitigate this problem are then proposed. The first solution relies on quantitative steganalyzer. The second solution transforms one-sided hypothesis test (unknown message length) to simple hypothesis test by assuming a probability distribution on length of messages, which can be efficiently solved by many machine-learning tools, e.g. by Support Vector Machines. The experimental section of the paper (a) compares both solutions on steganalysis of F5 algorithm with shrinkage removed by wet paper codes for JPEG images and LSB matching for raw (uncompressed) images, (b) investigates the effect of the assumed distribution of the message length on the accuracy of the steganalyzer, and (c) shows how the accuracy of steganalysis depends on Eve\'s knowledge about details of steganographic channel.

Modern Steganalysis Can Detect YASS

Autoři: Kodovský, J., doc. Ing. Tomáš Pevný, Ph.D., Fridrich, J.
Publikace: Media Forensics and Security II. Washington: SPIE, 2010. p. 1-11. ISSN 0277-786X. ISBN 978-0-8194-7934-1.
Rok: 2010

DOI: 10.1117/12.838768
Odkaz: https://doi.org/10.1117/12.838768
Pracoviště: Katedra kybernetiky
Anotace:
YASS is a steganographic algorithm for digital images that hides messages robustly in a key-dependent transform domain so that the stego image can be subsequently compressed and distributed as JPEG. Given the fact that state-of-the-art blind steganalysis methods of 2007, when YASS was proposed, were unable to reliably detect YASS, in this paper we steganalyze YASS using several recently proposed general-purpose steganalysis feature sets. The focus is on blind attacks that do not capitalize on any weakness of a specific implementation of the embedding algorithm. We demonstrate experimentally that twelve different settings of YASS can be reliably detected even for small embedding rates and in small images. Since none of the steganalysis feature sets is in any way targeted to the embedding of YASS, future modifications of YASS will likely be detectable by them as well

Steganalysis by Subtractive Pixel Adjacency Matrix

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Bas, T., Fridrich, J.
Publikace: IEEE Transactions on Information Forensics and Security. 2010, 2 215-224. ISSN 1556-6013.
Rok: 2010

DOI: 10.1109/TIFS.2010.2045842
Odkaz: https://doi.org/10.1109/TIFS.2010.2045842
Pracoviště: Katedra kybernetiky
Anotace:
This paper presents a method for detection of steganographic methods that embed in the spatial domain by adding a low-amplitude independent stego signal, an example of which is least significant bit (LSB) matching. First, arguments are provided for modeling the differences between adjacent pixels using first-order and second-order Markov chains. Subsets of sample transition probability matrices are then used as features for a steganalyzer implemented by support vector machines. The major part of experiments, performed on four diverse image databases, focuses on evaluation of detection of LSB matching. The comparison to prior art reveals that the presented feature set offers superior accuracy in detecting LSB matching. Even though the feature set was developed specifically for spatial domain steganalysis, by constructing steganalyzers for ten algorithms for JPEG images, it is demonstrated that the features detect steganography in the transform domain as well.

Using High-Dimensional Image Models to Perform Highly Undetectable Steganography

Autoři: doc. Ing. Tomáš Pevný, Ph.D., Filler, T., Bas, P.
Publikace: Information Hiding, Lecture Notes in Computer Science. Berlin: Springer, 2010. p. 161-177. ISSN 0302-9743. ISBN 978-3-642-16434-7.
Rok: 2010

DOI: 10.1007/978-3-642-16435-4_13
Odkaz: https://doi.org/10.1007/978-3-642-16435-4_13
Pracoviště: Katedra kybernetiky
Anotace:
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 10^{7}. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching.

doc. Ing. Tomáš Pevný, Ph.D.

Všechny publikace

Bias Detection via Maximum Subgroup Discrepancy

Generating Likely Counterfactuals Using Sum-Product Networks

State Encodings for GNN-Based Lifted Planners

Classification with Costly Features in Hierarchical Deep Sets

Deep anomaly detection on set data: Survey and comparison

GraphSPNs: Sum-Product Networks Benefit From Canonical Orderings

Malicious Internet Entity Detection Using Local Graph Inference

NASimEmu: Network Attack Simulator & Emulator for Training Agents Generalizing to Novel Scenarios

On the Economics of Adversarial Machine Learning

Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs

Heuristic Search Optimisation Using Planning and Curriculum Learning Techniques

Leveraging Data Geometry to Mitigate CSM in Steganalysis

Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal.

Sum-Product-Set Networks

The Non-Zero-Sum Game of Steganography in Heterogeneous Environments

Backpack: A Backpropagable Adversarial Embedding Schem

Comparison of Anomaly Detectors: Context Matters

Formalizing Cover-source Mismatch as a Robust Optimization

General framework for binary classification on top samples

JsonGrinder.jl: Automated Differentiable Neural Architecture for Embedding Arbitrary JSON Data

Reducing the cost of fitting mixture models via stochastic sampling

Semi-supervised deep networks for plasma state identification

Using Set Covering to Generate Databases for Holistic Steganalysis

Explicit Optimization of min max Steganographic Game

When Should You Defend Your Classifier? A Game-Theoretical Analysis of Countermeasures Against Adversarial Examples

Anomaly explanation with random forests

Classification with Costly Features as a Sequential Decision-making Problem

Detection of Alfven Eigenmodes on COMPASS with Generative Neural Networks

Loss Functions for Clustering in Multi-instance Learning

Neural Power Units

Sum-Product-Transform Networks: Exploiting Symmetries using Invertible Transformations

Classification with Costly Features Using Deep Reinforcement Learning

Exploiting Adversarial Embeddings for Better Steganography

Joint Detection of Malicious Domains and Infected Clients

Orthogonal Approximation of Marginal Likelihood of Generative Models

Rodent: Relevance determination in ODE

Exploring Non-Additive Distortion in Steganography

Multiple instance learning for malware classification

Network traffic fingerprinting based on approximated kernel two-sample test

Probabilistic analysis of dynamic malware traces

Malware Detection by Analysing Encrypted Network Traffic with Neural Networks

Optimal Strategies for Detecting Data Exfiltration by Internal and External Attackers

Reducing False Positives of Network Anomaly Detection by Local Adaptive Multivariate Smoothing

Using Neural Network Formalism to Solve Multiple-Instance Problems

Discriminative Models for Multi-instance Problems with Tree Structure

Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce

k-NN Classification of Malware in HTTPS Traffic Using the Metric Space Approach

Learning Combination of Anomaly Detectors for Security Domain

Loda: Lightweight on-line detector of anomalies

Malicons: Detecting Payload in Favicons

Passive NAT Detection Using HTTP Access Logs

Rethinking Optimal Embedding

Using Behavioral Similarity for Botnet Command-and-Control Discovery

Automatic Discovery of Web Servers Hosting Similar Applications

Finding New Malicious Domains Using Variational Bayes on Large-Scale Computer Network Data

Is Ensemble Classifier Needed for Steganalysis in High-Dimensional Feature Spaces?

Optimizing pooling function for pooled steganalysis

Towards dependable steganalysis

Towards Scalable Network Host Simulation

Unsupervised Detection of Malware in Persistent Web Traffic

A Memory Efficient Privacy Preserving Representation of Connection Graphs

A mishmash of methods for mitigating the model mismatch mess

Explaining Anomalies with Sampling Random Forests

Explaining Anomalies with Sapling Random Forests

Interpreting and clustering outliers with sapling random forests

Interpreting and clustering outliers with sapling random forests

Randomized Operating Point Selection in Adversarial Classification

Steganographic key leakage through payload metadata

The Steganographer is the Outlier: Realistic Large-Scale Steganalysis

Anomaly detection by bagging

Attacking the IDS learning processes

Moving Steganography and Steganalysis from the Laboratory into the Real World

The Challenges of Rich Features in Universal Steganalysis

Batch steganography in the real world

Co-occurrence steganalysis in high dimensions

Detecting anomalous network hosts by means of PCA

From Blind to Quantitative Steganalysis

Identifying a steganographer in realistic and heterogeneous data sets