Lidé

doc. Ing. Tomáš Pevný, Ph.D.

Všechny publikace

NASimEmu: Network Attack Simulator & Emulator for Training Agents Generalizing to Novel Scenarios

  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Current frameworks for training offensive penetration testing agents with deep reinforcement learning struggle to produce agents that perform well in real-world scenarios, due to the reality gap in simulation-based frameworks and the lack of scalability in emulation-based frameworks. Additionally, existing frameworks often use an unrealistic metric that measures the agents' performance on the training data. NASimEmu, a new framework introduced in this paper, addresses these issues by providing both a simulator and an emulator with a shared interface. This approach allows agents to be trained in simulation and deployed in the emulator, thus verifying the realism of the used abstraction. Our framework promotes the development of general agents that can transfer to novel scenarios unseen during their training. For the simulation part, we adopt an existing simulator NASim and enhance its realism. The emulator is implemented with industry-level tools, such as Vagrant, VirtualBox, and Metasploit. Experiments demonstrate that a simulation-trained agent can be deployed in emulation, and we show how to use the framework to train a general agent that transfers into novel, structurally different scenarios. NASimEmu is available as open-source.

Sum-Product-Set Networks: Deep Tractable Models for Tree-Structured Graphs

  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Daily internet communication relies heavily on tree-structured graphs, embodied by popular data formats such as XML and JSON. However, many recent generative (probabilistic) models utilize neural networks to learn a probability distribution over undirected cyclic graphs. This assumption of a generic graph structure brings various computational challenges, and, more importantly, the presence of non-linearities in neural networks does not permit tractable probabilistic inference. We address these problems by proposing sum-product-set networks, an extension of probabilistic circuits from unstructured tensor data to tree-structured graph data. To this end, we use random finite sets to reflect a variable number of nodes and edges in the graph and to allow for exact and efficient inference. We demonstrate that our tractable model performs comparably to various intractable models based on neural networks.

Backpack: A Backpropagable Adversarial Embedding Schem

  • Autoři: Bernard, S., Bas, P., Klein, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE Transactions on Information Forensics and Security. 2022, 17 3539-3554. ISSN 1556-6013.
  • Rok: 2022
  • DOI: 10.1109/TIFS.2022.3204218
  • Odkaz: https://doi.org/10.1109/TIFS.2022.3204218
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    A minmax protocol offers a general method to automatically optimize steganographic algorithm against a wide class of steganalytic detectors. The quality of the resulting steganograhic algorithm depends on the ability to find an 'adversarial' stego image undetectable by a set of detectors while communicating a given message. Despite minmax protocol instantiated with ADV-EMB scheme leading to unexpectedly good results, we show it suffers a significant flaw and we present a theoretically sound solution called Backpack. Extensive experimental verification of minmax protocol with Backpack shows superior performance to ADV-EMB, the generality of the tool by targeting a new JPEG QF100 compatibility attack and further improves the security of steganographic algorithms.

Comparison of Anomaly Detectors: Context Matters

  • DOI: 10.1109/TNNLS.2021.3116269
  • Odkaz: https://doi.org/10.1109/TNNLS.2021.3116269
  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    Deep generative models are challenging the classical methods in the field of anomaly detection nowadays. Every newly published method provides evidence of outperforming its predecessors, sometimes with contradictory results. The objective of this article is twofold: to compare anomaly detection methods of various paradigms with a focus on deep generative models and identification of sources of variability that can yield different results. The methods were compared on popular tabular and image datasets. We identified that the main sources of variability are the experimental conditions: 1) the type of dataset (tabular or image) and the nature of anomalies (statistical or semantic) and 2) strategy of selection of hyperparameters, especially the number of available anomalies in the validation set. Methods perform differently in different contexts, i.e., under a different combination of experimental conditions together with computational time. This explains the variability of the previous results and highlights the importance of careful specification of the context in the publication of a new method. All our code and results are available for download.

Formalizing Cover-source Mismatch as a Robust Optimization

  • Autoři: Šepák, D., Adam, L., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of European Signal Processing Conference. Belgrade: EUSIPCO, 2022. p. 1042-1046. ISBN 978-1-6654-6798-8.
  • Rok: 2022
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Cover-source mismatch (CSM) refers to the use of a steganographic detector on images with a very different probability distribution it has been trained on. This can have a detrimental effect on its accuracy preventing the use of modern steganalytic tools outside laboratories. Despite CSM being introduced almost fifteen years ago, there is no formal definition and no adopted measures for comparing different solutions. This work, therefore, formalizes the cover-source mismatch and proposes and discusses possible error measures. Equipped with these tools, we propose a principled approach to train holistic detectors while minimizing the effects of CSM and experimentally compare them to the prior art, discussing their strength and weaknesses.

General framework for binary classification on top samples

  • DOI: 10.1080/10556788.2021.1965601
  • Odkaz: https://doi.org/10.1080/10556788.2021.1965601
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Many binary classification problems minimize misclassification above (or below) a threshold. We show that instances of ranking problems, accuracy at the top, or hypothesis testing may be written in this form. We propose a general framework to handle these classes of problems and show which formulations (both known and newly proposed) fall into this framework. We provide a theoretical analysis of this framework and mention selected possible pitfalls the formulations may encounter. We show the convergence of the stochastic gradient descent for selected formulations even though the gradient estimate is inherently biased. We suggest several numerical improvements, including the implicit derivative and stochastic gradient descent. We provide an extensive numerical study.

JsonGrinder.jl: Automated Differentiable Neural Architecture for Embedding Arbitrary JSON Data

  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Standard machine learning (ML) problems are formulated on data converted into a suitable tensor representation. However, there are data sources, for example in cybersecurity, that are naturally represented in a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to a tensor representation is usually done by manual feature engineering, which is laborious, lossy, and prone to bias originating from the human inability to correctly judge the importance of particular features. JsonGrinder.jl is a library automating various ML tasks on these difficult sources. Starting with an arbitrary set of JSON samples, it automatically creates a differentiable ML model (called hmilnet), which embeds raw JSON samples into a fixed-size tensor representation. This embedding network can be naturally extended by an arbitrary ML model expecting tensor inputs in order to perform classification, regression, or clustering.

Reducing the cost of fitting mixture models via stochastic sampling

  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Traditional methods for unsupervised learning of finite mixture models require to evaluate the likelihood of all components of the mixture. This quickly becomes prohibitive when the components are abundant or expensive to compute. Therefore, we propose to apply a combination of the expectation maximization and the Metropolis-Hastings algorithm to evaluate only a small number of, stochastically sampled, components, thus substantially reducing the computational cost. The Markov chain of component assignments is sequentially generated across the algorithm's iterations, having a non-stationary target distribution whose parameters vary via a gradient-descent scheme. We put emphasis on generality of our method, equipping it with the ability to train mixture models which involve complex, and possibly nonlinear, transformations. The performance of our method is illustrated on mixtures of normalizing flows.

Semi-supervised deep networks for plasma state identification

  • DOI: 10.1088/1361-6587/ac9926
  • Odkaz: https://doi.org/10.1088/1361-6587/ac9926
  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    Correct and timely detection of plasma confinement regimes and edge localized modes (ELMs) is important for improving the operation of tokamaks. Existing machine learning approaches detect these regimes as a form of post-processing of experimental data. Moreover, they are typically trained on a large dataset of tens of labeled discharges, which may be costly to build. We investigate the ability of current machine learning approaches to detect the confinement regime and ELMs with the smallest possible delay after the latest measurement. We also demonstrate that including unlabeled data into the training process can improve the results in a situation where only a limited set of reliable labels is available. All training and validation is performed on data from the COMPASS tokamak. The InceptionTime architecture trained using a semi-supervised approach was found to be the most accurate method based on the set of tested variants. It is able to achieve good overall accuracy of the regime classification at the time instant of 100 μs delayed behind the latest data record. We also evaluate the capability of the model to correctly predict class transitions. While ELM occurrence can be detected with a tolerance smaller than 50 μs, detection of the confinement regime transition is more demanding and it was successful with 2 ms tolerance. Sensitivity studies to different values of model parameters are provided. We believe that the achieved accuracy is acceptable in practice and the method could be used in real-time operation.

Using Set Covering to Generate Databases for Holistic Steganalysis

  • Autoři: Abecidan, R., Itier, V., Boulanger, J., Bas, P., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE International Workshop on Information Forensics and Security. Institute of Electrical and Electronics Engineers, Inc., 2022. ISSN 2157-4766. ISBN 979-8-3503-0967-6.
  • Rok: 2022
  • DOI: 10.1109/WIFS55849.2022.9975430
  • Odkaz: https://doi.org/10.1109/WIFS55849.2022.9975430
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Within an operational framework, covers used by a steganographer are likely to come from different sensors and different processing pipelines than the ones used by researchers for training their steganalysis models. Thus, a performance gap is unavoidable when it comes to out-of-distributions covers, an extremely frequent scenario called Cover Source Mismatch (CSM). Here, we explore a grid of processing pipelines to study the origins of CSM, to better understand it, and to better tackle it. A set-covering greedy algorithm is used to select representative pipelines minimizing the maximum regret between the representative and the pipelines within the set. Our main contribution is a methodology for generating relevant bases able to tackle operational CSM. Experimental validation highlights that, for a given number of training samples, our set covering selection is a better strategy than selecting random pipelines or using all the available pipelines. Our analysis also shows that parameters as denoising, sharpening, and downsampling are very important to foster diversity. Finally, different benchmarks for classical and wild databases show the good generalization property of the extracted databases.

Explicit Optimization of min max Steganographic Game

  • Autoři: Bernard, S., Bas, P., Klein, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE Transactions on Information Forensics and Security. 2021, 2020(16), 812-823. ISSN 1556-6013.
  • Rok: 2021
  • DOI: 10.1109/TIFS.2020.3021913
  • Odkaz: https://doi.org/10.1109/TIFS.2020.3021913
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    This article proposes an algorithm which allows Alice to simulate the game played between her and Eve. Under the condition that the set of detectors that Alice assumes Eve to have is sufficiently rich (e.g. CNNs), and that she has an algorithm enabling to avoid detection by a single classifier (e.g adversarial embedding, gibbs sampler, dynamic STCs), the proposed algorithm converges to an efficient steganographic algorithm. This is possible by using a min max strategy which consists at each iteration in selecting the least detectable stego image for the best classifier among the set of Eve's learned classifiers. The algorithm is extensively evaluated and compared to prior arts and results show the potential to increase the practical security of classical steganographic methods. For example the error probability P-err of XU-Net on detecting stego images with payload of 0.4 bpnzAC embedded by J-Uniward and QF 75 starts at 7.1% and is increased by +13.6% to reach 20.7% after eight iterations. For the same embedding rate and for QF 95, undetectability by XU-Net with J-Uniward embedding is 23.4%, and it jumps by +25.8% to reach 49.2% at iteration 3.

When Should You Defend Your Classifier? A Game-Theoretical Analysis of Countermeasures Against Adversarial Examples

  • Autoři: Samsinger, M., Merkle, F., Schöttle, P., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: International Conference on Decision and Game Theory for Security. Basel: Springer Nature Switzerland AG, 2021. p. 158-177. ISSN 0302-9743. ISBN 978-3-030-90369-5.
  • Rok: 2021
  • DOI: 10.1007/978-3-030-90370-1_9
  • Odkaz: https://doi.org/10.1007/978-3-030-90370-1_9
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Adversarial machine learning, i.e., increasing the robustness of machine learning algorithms against so-called adversarial examples, is now an established field. Yet, newly proposed methods are evaluated and compared under unrealistic scenarios where costs for adversary and defender are not considered and either all samples or no samples are adversarially perturbed. We scrutinize these assumptions and propose the advanced adversarial classification game, which incorporates all relevant parameters of an adversary and a defender. Especially, we take into account economic factors on both sides and the fact that all so far proposed countermeasures against adversarial examples reduce accuracy on benign samples. Analyzing the scenario in detail, where both players have two pure strategies, we identify all best responses and conclude that in practical settings, the most influential factor might be the maximum amount of adversarial examples.

Anomaly explanation with random forests

  • DOI: 10.1016/j.eswa.2020.113187
  • Odkaz: https://doi.org/10.1016/j.eswa.2020.113187
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Anomaly detection has become an important topic in many domains with many different solutions proposed until now. Despite that, there are only a few anomaly detection methods trying to explain how the sample differs from the rest. This work contributes to filling this gap because knowing why a sample is considered anomalous is critical in many application domains. The proposed solution uses a specific type of random forests to extract rules explaining the difference, which are then filtered and presented to the user as a set of classification rules sharing the same consequent, or as the equivalent rule with an antecedent in a disjunctive normal form. The quality of that solution is documented by comparison with the state of the art algorithms on 34 real-world datasets.

Classification with Costly Features as a Sequential Decision-making Problem

  • DOI: 10.1007/s10994-020-05874-8
  • Odkaz: https://doi.org/10.1007/s10994-020-05874-8
  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    This work focuses on a specific classification problem, where the information about a sample is not readily available, but has to be acquired for a cost, and there is a per-sample budget. Inspired by real-world use-cases, we analyze average and hard variations of a directly specified budget. We postulate the problem in its explicit formulation and then convert it into an equivalent MDP, that can be solved with deep reinforcement learning. Also, we evaluate a real-world inspired setting with sparse training datasets with missing features. The presented method performs robustly well in all settings across several distinct datasets, outperforming other prior-art algorithms. The method is flexible, as showcased with all mentioned modifications and can be improved with any domain independent advancement in RL.

Detection of Alfven Eigenmodes on COMPASS with Generative Neural Networks

  • DOI: 10.1080/15361055.2020.1820805
  • Odkaz: https://doi.org/10.1080/15361055.2020.1820805
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Chirping Alfvén eigenmodes (AE) were observed at the COMPASS tokamak. They are believed to be driven by runaway electrons (RE) and as such, they provide a unique opportunity to study physics of non-linear interaction between RE and electromagnetic instabilities, including important topics of RE mitigation and losses. On COMPASS, they can be detected from spectrograms of certain magnetic probes. So far, their detection required a lot of manual effort since they occur rarely. We strive to automate this process using machine learning techniques based on generative neural networks. We present two different models that are trained using a smaller, manually labeled database and a larger unlabeled database from COMPASS experiments. On a number of experiments, we demonstrate that our approach is a viable option for automated detection of rare instabilities in tokamak plasma.

Loss Functions for Clustering in Multi-instance Learning

  • Autoři: Dědič, M., doc. Ing. Tomáš Pevný, Ph.D., Bajer, L., Holena, M.
  • Publikace: Proceedings of the 20th Conference Information Technologies - Applications and Theory (ITAT 2020). Aachen: CEUR Workshop Proceedings, 2020. p. 137-146. ISSN 1613-0073.
  • Rok: 2020
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Multi-instance learning belongs to one of recently fast developing areas of machine learning. It is a supervised learning method and this paper reports research into its unsupervised counterpart, multi-instance clustering. Whereas traditional clustering clusters points, multiinstance clustering clusters bags, i.e. multisets of points or of other kinds of objects. The paper focuses on the problem of loss functions for clustering. Three sophisticated loss functions used for clustering of points, contrastive predictive coding, triplet loss and magnet loss, are elaborated for multi-instance clustering. Finally, they are compared on 18 benchmark datasets, as well as on a real-world dataset.

Neural Power Units

  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    Conventional Neural Networks can approximate simple arithmetic operations, but fail to generalize beyond the range of numbers that were seen during training. Neural Arithmetic Units aim to overcome this difficulty, but current arithmetic units are either limited to operate on positive numbers or can only represent a subset of arithmetic operations. We introduce the Neural Power Unit (NPU) that operates on the full domain of real numbers and is capable of learning arbitrary power functions in a single layer. The NPU thus fixes the shortcomings of existing arithmetic units and extends their expressivity. We achieve this by using complex arithmetic without requiring a conversion of the network to complex numbers. A simplification of the unit to the RealNPU yields a highly transparent model. We show that the NPUs outperform their competitors in terms of accuracy and sparsity on artificial arithmetic datasets, and that the RealNPU can discover the governing equations of a dynamical system only from data.

Sum-Product-Transform Networks: Exploiting Symmetries using Invertible Transformations

  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    We propose Sum-Product-Transform Networks (SPTN), an extension of sum-product networks that uses invertible transformations as additional internal nodes. The type and placement of transformations determine properties of the resulting SPTN with many interesting special cases. Importantly, SPTN with Gaussian leaves and affine transformations pose the same inference task tractable that can be computed efficiently in SPNs. We propose to store and optimize affine transformations in their SVD decompositions using an efficient parametrization of unitary matrices by a set of Givens rotations. Last but not least, we demonstrate that G-SPTNs pushes the state-of-the-art on the density estimation task on used datasets.

Classification with Costly Features Using Deep Reinforcement Learning

  • DOI: 10.1609/aaai.v33i01.33013959
  • Odkaz: https://doi.org/10.1609/aaai.v33i01.33013959
  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    We study a classification problem where each feature can be acquired for a cost and the goal is to optimize a trade-off between the expected classification error and the feature cost.We revisit a former approach that has framed the problem as a sequential decision-making problem and solved it by Q-learning with a linear approximation, where individual actions are either requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems, we demonstrate that by replacing the linear approximation with neural networks the approach becomes comparable to the state-of-the-art algorithms developed specifically for this problem. The approach is flexible, as it can be improved with any new reinforcement learning enhancement, it allows inclusion of pre-trained high-performance classifier, and unlike prior art, its performance is robust across all evaluated datasets.

Exploiting Adversarial Embeddings for Better Steganography

  • Autoři: Bernard, S., doc. Ing. Tomáš Pevný, Ph.D., Bas, T., Klein, J.
  • Publikace: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2019. p. 216-221. ISBN 978-1-4503-6821-6.
  • Rok: 2019
  • DOI: 10.1145/3335203.3335737
  • Odkaz: https://doi.org/10.1145/3335203.3335737
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    This work proposes a protocol to iteratively build a distortion function for adaptive steganography while increasing its practical security after each iteration. It relies on prior art on targeted attacks and iterative design of steganalysis schemes. It combines targeted attacks on a given detector with a \min\max strategy, which dynamically selects the most difficult stego content associated with the best classifier at each iteration. We theoretically prove the convergence, which is confirmed by the practical results. Applied on J-Uniward this new protocol increases \perr from 7% to 20% estimated by Xu-Net, and from 10% to 23% for a non-targeted steganalysis by a linear classifier with GFR features.

Joint Detection of Malicious Domains and Infected Clients

  • Autoři: Presse, P., Knaebel, R., Machlica, L., doc. Ing. Tomáš Pevný, Ph.D., Scheffer, T.
  • Publikace: Machine Learning. 2019, 108(8-9), 1353-1368. ISSN 0885-6125.
  • Rok: 2019
  • DOI: 10.1007/s10994-019-05789-z
  • Odkaz: https://doi.org/10.1007/s10994-019-05789-z
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Detection of malware-infected computers and detection of malicious web domains based on their encrypted HTTPS traffic are challenging problems, because only addresses, timestamps, and data volumes are observable. The detection problems are coupled, because infected clients tend to interact with malicious domains. Traffic data can be collected at a large scale, and antivirus tools can be used to identify infected clients in retrospect. Domains, by contrast, have to be labeled individually after forensic analysis. We explore transfer learning based on sluice networks; this allows the detection models to bootstrap each other. In a large-scale experimental study, we find that the model outperforms known reference models and detects previously unknown malware, previously unknown malware families, and previously unknown malicious domains.

Orthogonal Approximation of Marginal Likelihood of Generative Models

  • Autoři: Šmídl, V., Bím, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the Bayesian Deep Learning. Amsterdam: University of Amsterdam, 2019.
  • Rok: 2019
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    This paper presents a new approximation of the marginal likelihood of generativemodels which is used as a score for anomaly detection. The score is motivatedby the shortcoming of the popular reconstruction error that it can behave arbitrar-ily outside the known samples. The proposed score corrects this by orthogonalcombination of the reconstruction error and the likelihood in the latent space. Asexperimentally shown on benchmark problems from anomaly detection and illus-trated on a toy problem, this combination lends the score robustness to outliers.Generative models evaluated with this score outperformed the competing meth-ods especially in tasks of learning distribution from data corrupted by anomalies.Finally, the score is compatible with contemporary generative models, namelyvariational auto-encoders and generative adversarial networks.

Rodent: Relevance determination in ODE

  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    From a set of observed trajectories of a partially observed system, we aim to learnits underlying (physical) process without having to make too many assumptionsabout the generating model. We start with a very general, over-parameterizedordinary differential equation(ODE) of orderNand learn the minimal complexityof the model, by which we mean both the order of the ODE as well as the minimumnumber of non-zero parameters that are needed to solve the problem. The minimalcomplexity is found by combining theVariational Auto-Encoder(VAE) withAuto-matic Relevance Determination(ARD) to the problem of learning the parametersof an ODE which we callRodent. We show that it is possible to learn not onlyone specific model for a single process, but a manifold of models representingharmonic signals in general.

Exploring Non-Additive Distortion in Steganography

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, A.D.
  • Publikace: Proceedings of the 6th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2018. p. 109-114. ISBN 978-1-4503-5625-1.
  • Rok: 2018
  • DOI: 10.1145/3206004.3206015
  • Odkaz: https://doi.org/10.1145/3206004.3206015
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Leading steganography systems make use of the Syndrome-Trellis Code (STC) algorithm to minimize a distortion function while encoding the desired payload, but this constrains the distortion function to be additive. The Gibbs Embedding algorithm works for a certain class of non-additive distortion functions, but has its own limitations and is highly complex. In this short paper we show that it is possible to modify the STC algorithm in a simple way, to minimize a non-additive distortion function suboptimally. We use it for two examples. First, applying it to the S-UNIWARD distortion function, we show that it does indeed reduce distortion, compared with minimizing the additive approximation currently used in image steganography, but that it makes the payload more -- not less -- detectable. This parallels research attempting to use Gibbs Embedding for the same task. Second, we apply it to distortion defined by the output of a specific detector, as a counter-move in the steganography game. However, unless the Warden is forced to move first (by fixing the detector) this is highly detectable.

Multiple instance learning for malware classification

  • Autoři: Stiborek, J., doc. Ing. Tomáš Pevný, Ph.D., Rehák, M.
  • Publikace: Expert Systems with Applications. 2018, 2018(93), 346-357. ISSN 0957-4174.
  • Rok: 2018
  • DOI: 10.1016/j.eswa.2017.10.036
  • Odkaz: https://doi.org/10.1016/j.eswa.2017.10.036
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    This work addresses classification of unknown binaries executed in sandbox by modeling their interaction with system resources (files, mutexes, registry keys and communication with servers over the network) and error messages provided by the operating system, using vocabulary-based method from the multiple instance learning paradigm. It introduces similarities suitable for individual resource types that combined with an approximative clustering method efficiently group the system resources and define features directly from data. This approach effectively removes randomization often employed by malware authors and projects samples into low-dimensional feature space suitable for common classifiers. An extensive comparison to the state of the art on a large corpus of binaries demonstrates that the proposed solution achieves superior results using only a fraction of training samples. Moreover, it makes use of a source of information different than most of the prior art, which increases the diversity of tools detecting the malware, hence making detection evasion more difficult.

Network traffic fingerprinting based on approximated kernel two-sample test

  • Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE Transactions on Information Forensics and Security. 2018, 13(3), 788-801. ISSN 1556-6013.
  • Rok: 2018
  • DOI: 10.1109/TIFS.2017.2768018
  • Odkaz: https://doi.org/10.1109/TIFS.2017.2768018
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Abstract: Many applications and communication protocols exhibit unique communication patterns that can be exploited to identify them in network traffic. This work proposes a method to represent these patterns compactly such that they can be used in different analytical tasks. The method treats each communication as a set of observations of a random variable with unknown probability distribution. This view allows to derive the representation from a distance between two probability distributions used in Maximum Mean Discrepancy — a non-parametric kernel test. The representation (and distance) can be then easily used in various algorithms for identification of communicating application and data analysis, independently of the specific type of input data.

Probabilistic analysis of dynamic malware traces

  • DOI: 10.1016/j.cose.2018.01.012
  • Odkaz: https://doi.org/10.1016/j.cose.2018.01.012
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    We propose a method to automatically group unknown binaries executed in sandbox according to their interaction with system resources (files on the filesystem, mutexes, registry keys, network communication with remote servers and error messages generated by operating system) such that each group corresponds to a malware family. The method utilizes probabilistic generative model (Bernoulli mixture model), which allows human-friendly prioritization of identified clusters and extraction of readable behavioral indicators to maximize interpretability. We compare it to relevant prior art on a large set of malware binaries where a quality of cluster prioritization and automatic extraction of indicators of compromise is demonstrated. The proposed approach therefore implements complete pipeline which has the potential to significantly speed-up analysis of unknown samples.

Malware Detection by Analysing Encrypted Network Traffic with Neural Networks

  • Autoři: Prasse, P., Machlica, L., doc. Ing. Tomáš Pevný, Ph.D., Havelka, J., Scheffer, T.
  • Publikace: Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing AG, 2017. p. 73-88. vol. I, II, III. ISSN 0302-9743. ISBN 978-3-319-71245-1.
  • Rok: 2017
  • DOI: 10.1007/978-3-319-71246-8_5
  • Odkaz: https://doi.org/10.1007/978-3-319-71246-8_5
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    We study the problem of detecting malware on client computers based on the analysis of HTTPS traffic. Here, malware has to be detected based on the host address, timestamps, and data volume information of the computer’s network traffic. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural embedding of domain names and a long short-term memory network that processes network flows. We study the method’s ability to detect new malware in a large-scale empirical study.

Optimal Strategies for Detecting Data Exfiltration by Internal and External Attackers

  • DOI: 10.1007/978-3-319-68711-7_10
  • Odkaz: https://doi.org/10.1007/978-3-319-68711-7_10
  • Pracoviště: Katedra počítačů, Centrum umělé inteligence
  • Anotace:
    We study the problem of detecting data exfiltration in computer networks. We focus on the performance of optimal defense strategies with respect to an attacker’s knowledge about typical network behavior and his ability to influence the standard traffic. Internal attackers know the typical upload behavior of the compromised host and may be able to discontinue standard uploads in favor of the exfiltration. External attackers do not immediately know the behavior of the compromised host, but they can learn it from observations.We model the problem as a sequential game of imperfect information, where the network administrator selects the thresholds for the detector, while the attacker chooses how much data to exfiltrate in each time step. We present novel algorithms for approximating the optimal defense strategies in the form of Stackelberg equilibria. We analyze the scalability of the algorithms and efficiency of the produced strategies in a case study based on real-world uploads of almost six thousand users to Google Drive. We show that with the computed defense strategies, the attacker exfiltrates 2–3 times less data than with simple heuristics; randomized defense strategies are up to 30% more effective than deterministic ones, and substantially more effective defense strategies are possible if the defense is customized for groups of hosts with similar behavior.

Reducing False Positives of Network Anomaly Detection by Local Adaptive Multivariate Smoothing

  • Autoři: Grill, M., doc. Ing. Tomáš Pevný, Ph.D., Rehák, M.
  • Publikace: Journal of Computer and System Sciences. 2017, 83(1), 43-57. ISSN 0022-0000.
  • Rok: 2017
  • DOI: 10.1016/j.jcss.2016.03.007
  • Odkaz: https://doi.org/10.1016/j.jcss.2016.03.007
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Network intrusion detection systems based on the anomaly detection paradigm have high false alarm rate making them difficult to use. To address this weakness, we propose to smooth the outputs of anomaly detectors by online Local Adaptive Multivariate Smoothing (LAMS). LAMS can reduce a large portion of false positives introduced by the anomaly detection by replacing the anomaly detector's output on a network event with an aggregate of its output on all similar network events observed previously. The arguments are supported by extensive experimental evaluation involving several anomaly detectors in two domains: NetFlow and proxy logs. Finally, we show how the proposed solution can be efficiently implemented to process large streams of non-stationary data.

Using Neural Network Formalism to Solve Multiple-Instance Problems

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Somol, P.
  • Publikace: Advances in Neural Networks - ISNN 2017. Wien: Springer, 2017. p. 135-142. LNCS. vol. 10261. ISSN 0302-9743. ISBN 978-3-319-59071-4.
  • Rok: 2017
  • DOI: 10.1007/978-3-319-59072-1_17
  • Odkaz: https://doi.org/10.1007/978-3-319-59072-1_17
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    a fixed length, whereas describing them by means of a set of vectors is more natural. Therefore, Multiple instance learning (MIL) techniques have been constantly gaining in importance throughout the last years. MIL formalism assumes that each object (sample) is represented by a set (bag) of feature vectors (instances) of fixed length, where knowledge about objects (e.g., class label) is available on bag level but not necessarily on instance level. Many standard tools including supervised classifiers have been already adapted to MIL setting since the problem got formalized in the late nineties. In this work we propose a neural network (NN) based formalism that intuitively bridges the gap between MIL problem definition and the vast existing knowledge-base of standard models and classifiers. We show that the proposed NN formalism is effectively optimizable by a back-propagation algorithm and can reveal unknown patterns inside bags. Comparison to 14 types of classifiers from the prior art on a set of 20 publicly available benchmark datasets confirms the advantages and accuracy of the proposed solution.

Discriminative Models for Multi-instance Problems with Tree Structure

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Somol, Petr
  • Publikace: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. New York: ACM, 2016. pp. 83-91. ISBN 978-1-4503-4573-6.
  • Rok: 2016
  • DOI: 10.1145/2996758.2996761
  • Odkaz: https://doi.org/10.1145/2996758.2996761
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Modelling network traffic is gaining importance to counter modern security threats of ever increasing sophistication. It is though surprisingly difficult and costly to construct reliable classifiers on top of telemetry data due to the variety and complexity of signals that no human can manage to interpret in full. Obtaining training data with sufficiently large and variable body of labels can thus be seen as a prohibitive problem. The goal of this work is to detect infected computers by observing their HTTP(S) traffic collected from network sensors, which are typically proxy servers or network firewalls, while relying on only minimal human input in the model training phase. We propose a discriminative model that makes decisions based on a computer's all traffic observed during a predefined time window (5 minutes in our case). The model is trained on traffic samples collected over equally-sized time windows for a large number of computers, where the only labels needed are (human) verdicts about the computer as a whole (presumed infected vs. presumed clean). As part of training, the model itself learns discriminative patterns in traffic targeted to individual servers and constructs the final high-level classifier on top of them. We show the classifier to perform with very high precision, and demonstrate that the learned traffic patterns can be interpreted as Indicators of Compromise. We implement the discriminative model as a neural network with special structure reflecting two stacked multi instance problems. The main advantages of the proposed configuration include not only improved accuracy and ability to learn from gross labels, but also automatic learning of server types (together with their detectors) that are typically visited by infected computers.

Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce

  • Autoři: Čech, Přemysl, Kohout, J., Lokoč, Jakub, Komárek, T., Maroušek, Jakub, doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Similarity Search and Applications. Basel: Springer, 2016. p. 311-324. 9939. ISSN 0302-9743. ISBN 978-3-319-46758-0.
  • Rok: 2016
  • DOI: 10.1007/978-3-319-46759-7_24
  • Odkaz: https://doi.org/10.1007/978-3-319-46759-7_24
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Secure HTTP network traffic represents a challenging immense data source for machine learning tasks. The tasks usually try to learn and identify infected network nodes, given only limited traffic features available for secure HTTP data. In this paper, we investigate the performance of grid histograms that can be used to aggregate traffic features of network nodes considering just 5-min batches for snapshots. We compare the representation using linear and k-NN classifiers. We also demonstrate that all presented feature extraction and classification tasks can be implemented in a scalable way using the MapReduce approach.

k-NN Classification of Malware in HTTPS Traffic Using the Metric Space Approach

  • Autoři: Lokoč, J., Kohout, J., Čech, P., Skopal, T., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Intelligence and Security Informatics. Düsseldorf: Springer VDI Verlag, 2016. p. 131-145. ISSN 0302-9743. ISBN 978-3-319-31862-2.
  • Rok: 2016
  • DOI: 10.1007/978-3-319-31863-9_10
  • Odkaz: https://doi.org/10.1007/978-3-319-31863-9_10
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    In this paper, we present detection of malware in HTTPS traffic using k-NN classification. We focus on the metric space approach for approximate k-NN searches over dataset of sparse high-dimensional descriptors of network traffic. We show the classification based on approximate k-NN search using metric index exhibits false positive rate reduced by an order of magnitude when compared to the state of the art method, while keeping the classification fast enough.

Learning Combination of Anomaly Detectors for Security Domain

  • DOI: 10.1016/j.comnet.2016.05.021
  • Odkaz: https://doi.org/10.1016/j.comnet.2016.05.021
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    This paper presents a novel technique of finding a convex combination of outputs of anomaly detectors maximizing the accuracy in τ-quantile of most anomalous samples. Such an approach better reflects the needs in the security domain in which subsequent analysis of alarms is costly and can be done only on a small number of alarms. An extensive experimental evaluation and comparison to prior art on real network data using sets of anomaly detectors of two existing intrusion detection systems shows that the proposed method not only outperforms prior art, it is also more robust to noise in training data labels, which is another important feature for deployment in practice.

Loda: Lightweight on-line detector of anomalies

  • DOI: 10.1007/s10994-015-5521-0
  • Odkaz: https://doi.org/10.1007/s10994-015-5521-0
  • Pracoviště: Katedra počítačů
  • Anotace:
    In supervised learning it has been shown that a collection of weak classifiers can result in a strong classifier with error rates similar to those of more sophisticated methods. In unsupervised learning, namely in anomaly detection such a paradigm has not yet been demonstrated despite the fact that many methods have been devised as counterparts to supervised binary classifiers. This work partially fills the gap by showing that an ensemble of very weak detectors can lead to a strong anomaly detector with a performance equal to or better than state of the art methods. The simplicity of the proposed ensemble system (to be called Loda) is particularly useful in domains where a large number of samples need to be processed in real-time or in domains where the data stream is subject to concept drift and the detector needs to be updated on-line. Besides being fast and accurate, Loda is also able to operate and update itself on data with missing variables. Loda is thus practical in domains with sensor outages. Moreover, Loda can identify features in which the scrutinized sample deviates from the majority. This capability is useful when the goal is to find out what has caused the anomaly. It should be noted that none of these favorable properties increase Loda’s low time and space complexity. We compare Loda to several state of the art anomaly detectors in two settings: batch training and on-line training on data streams. The results on 36 datasets from UCI repository illustrate the strengths of the proposed system, but also provide more insight into the more general questions regarding batch-vs-on-line anomaly detection.

Malicons: Detecting Payload in Favicons

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M., Křoustek, J, Ker, A.D
  • Publikace: Media Watermarking, Security, and Forensics 2016. Society for Imaging Science and Technology, 2016. ISSN 2470-1173.
  • Rok: 2016
  • DOI: 10.2352/ISSN.2470-1173.2016.8.MWSF-079
  • Odkaz: https://doi.org/10.2352/ISSN.2470-1173.2016.8.MWSF-079
  • Pracoviště: Katedra počítačů
  • Anotace:
    A recent version of the "Vawtrak" malware used steganography to hide the addresses of the command and control channels in favicons: small images automatically downloaded by the web browser. Since almost all research in steganalysis focuses on natural images, we study how well these methods can detect secret messages in favicons. The study is performed on a large corpus of favicons downloaded from the internet and applies a number of state-of-art steganalysis techniques, as well as proposing very simple novel features that exploit flat areas in favicons. The ultimate question is whether we can detect Vawtrak's steganographic favicons with a sufficiently low false positive rate.

Passive NAT Detection Using HTTP Access Logs

  • Autoři: Komárek, T., Grill, M., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: 2016 IEEE International Workshop on Information Forensics and Security. IEEE, 2016. ISSN 2157-4774. ISBN 978-1-5090-1138-4.
  • Rok: 2016
  • DOI: 10.1109/WIFS.2016.7823896
  • Odkaz: https://doi.org/10.1109/WIFS.2016.7823896
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Network devices performing Network Address Translation (NAT) overcome the problem of the deficit of IPv4 addresses as well as introduce a vulnerability to the network with possibly insecure configurations. Therefore detection of unauthorized NAT devices is an important task in the network security domain. In this paper, a novel passive NAT detection algorithm is proposed that identifies NAT devices in the network using statistical behavior analysis. We model behavior of network hosts using eight features extracted from HTTP access logs. These features are collected within consecutive non-overlapping time windows covering last 24 hours. To classify whether a host is a NAT device or an end host (non-NAT device) a pre-trained linear classifier is used. Since labeled data for training purposes is hard to obtain, we also propose a way how to generate the training data from unlabeled traffic logs.

Rethinking Optimal Embedding

  • Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D., Bas, Patrick
  • Publikace: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security. New York: ACM, 2016. pp. 93-102. ISBN 978-1-4503-4290-2.
  • Rok: 2016
  • DOI: 10.1145/2909827.2930797
  • Odkaz: https://doi.org/10.1145/2909827.2930797
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    At present, almost all leading steganographic techniques for still images use a distortion minimization paradigm, where each potential change is assigned a cost ci and the change probabilities πi chosen to minimize the average total cost ∑iπici. However, some detectors have exploited knowledge of this adaptivity and the embedding cannot be considered optimal. In this work we prove a theoretical result suggesting that, against a knowing attacker, the embedder should simply minimize ∑iπ2ici instead, for the same costs ci, which is the minimax and equilibrium strategy. This aligns with some special case results that have appeared in recent literature. We then test some simple steganographic methods in theoretical and real settings, showing that naive (average cost) adaptivity is exploitable, but the equilibrium probabilities cannot be exploited. However, it is essential to determine statistically well-founded costs ci.

Using Behavioral Similarity for Botnet Command-and-Control Discovery

  • Autoři: Jusko, J., Rehák, M., Stiborek, J., Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE Intelligent Systems. 2016, 31(5), 16-22. ISSN 1541-1672.
  • Rok: 2016
  • DOI: 10.1109/MIS.2016.88
  • Odkaz: https://doi.org/10.1109/MIS.2016.88
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    Malware authors and operators typically collaborate to achieve the optimal profit. They also frequently change their behavior and resources to avoid detection. The authors propose a social similarity metrics that exploits these relationships to improve the effectiveness and stability of the threat propagation algorithm typically used to discover malicious collaboration. Furthermore, they propose behavioral modeling as a way to group similarly behaving servers, enabling extension of the ground truth that's so expensive to obtain in the field of network security. The authors also show that seeding the threat propagation algorithm from a set of coherently behaving servers (instead of from a single known malicious server identified by threat intelligence) makes the algorithm far more effective and significantly more robust, without compromising the precision of findings.

Automatic Discovery of Web Servers Hosting Similar Applications

  • Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). Piscataway: IEEE, 2015. p. 1310-1315. ISBN 978-3-901882-76-0.
  • Rok: 2015
  • DOI: 10.1109/INM.2015.7140487
  • Odkaz: https://doi.org/10.1109/INM.2015.7140487
  • Pracoviště: Katedra počítačů
  • Anotace:
    Increasingly more popular cloud services have frequently many functional parts, which makes their structure rather complex yet its understanding improves network monitoring for security purposes, traffic routing, etc. Since the structure of third-party services is typically unknown, automated tools for its discovery are of great need. In this work, we propose such tool relying only on high-level statistics of servers' usage, such as volumes and times of interactions with the servers. Without looking into the communication contents, the method works for encrypted channels as well, which is experimentally demonstrated on Dropbox service and Windows Live platform.

Finding New Malicious Domains Using Variational Bayes on Large-Scale Computer Network Data

  • Autoři: Létal, V., doc. Ing. Tomáš Pevný, Ph.D., Somol, Petr, Smidl, Vasek
  • Publikace: Proceedings of NIPS workshop on Advances in Approximate Bayesian Inference. Montreal: Neural Information Processing Society, 2015, Available from: http://www.approximateinference.org/accepted/LetalEtAl2015.pdf
  • Rok: 2015
  • Pracoviště: Katedra počítačů
  • Anotace:
    The common limitation in computer network security is the reactive nature of defenses. A new type of infection typically needs to be first observed live, be- fore defensive measures can be taken. To improve the pro-active measures, we have developed a method utilizing WHOIS database (database of entities that has registered a particular domain) to model relations between domains even those not yet used. The model estimates the probability of a domain name being used for malicious purposes from observed connections to other related domains. The parameters of the model is inferred by a Variational Bayes method, and its effec- tiveness is demonstrated on a large-scale network data with millions of domains and trillions of connections to them.

Is Ensemble Classifier Needed for Steganalysis in High-Dimensional Feature Spaces?

  • Autoři: Cogranne, Remi, Sedighi, Vahid, Fridrich, Jessica, doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the 7th International Workshop on Forensics and Security. New Jersey: IEEE Signal Processing Society, 2015. pp. 1-6. ISBN 978-1-4673-6802-5.
  • Rok: 2015
  • DOI: 10.1109/WIFS.2015.7368597
  • Odkaz: https://doi.org/10.1109/WIFS.2015.7368597
  • Pracoviště: Katedra počítačů
  • Anotace:
    The ensemble classifier, based on Fisher Linear Discriminant base learners, was introduced specifically for steganalysis of digital media, which currently uses high-dimensional feature spaces. Presently it is probably the most used method to design supervised classifier for steganalysis of digital images because of its good detection accuracy and small computational cost. It has been assumed by the community that the classifier implements a non-linear boundary through pooling binary decision of individual classifiers within the ensemble. This paper challenges this assumption by showing that linear classifier obtained by various regularizations of the FLD can perform equally well as the ensemble. Moreover it demonstrates that using state of the art solvers linear classifiers can be trained more efficiently and offer certain potential advantages over the original ensemble leading to much lower computational complexity than the ensemble classifier. All claims are supported experimentally on a wide spectrum of stego schemes operating in both the spatial and JPEG domains with a multitude of rich steganalysis feature sets.

Optimizing pooling function for pooled steganalysis

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Nikolaev, I.
  • Publikace: Proceedings of the 7th International Workshop on Forensics and Security. New Jersey: IEEE Signal Processing Society, 2015. pp. 1-6. ISBN 978-1-4673-6802-5.
  • Rok: 2015
  • DOI: 10.1109/WIFS.2015.7368555
  • Odkaz: https://doi.org/10.1109/WIFS.2015.7368555
  • Pracoviště: Katedra počítačů
  • Anotace:
    Pooled steganalysis combines evidence from multiple objects to achieve higher accuracy in detecting hidden messages at the expense of granularity, as the decision is provided on the set of objects instead of a single one. Although it has been introduced almost decade ago, very little work has been done since then. This work builds upon recent advances in machine learning to show, how an optimal function combining outputs of a single object detector on a set of objects can be learned. Although experiments demonstrate that learned combining functions are superior to the prior art, more importantly they reveal many interesting phenomenons and points to direction of further research.

Towards dependable steganalysis

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
  • Publikace: Proceedings of SPIE Media Watermarking, Security, and Forensics 2015. Bellingham (stát Washington): SPIE, 2015. Proceedings of SPIE. ISSN 0277-786X. ISBN 978-1-62841-499-8.
  • Rok: 2015
  • DOI: 10.1117/12.2083216
  • Odkaz: https://doi.org/10.1117/12.2083216
  • Pracoviště: Katedra počítačů
  • Anotace:
    This paper considers the research goal of dependable steganalysis: where false positives occur once in a million or less, and this rate is known with high precision. Despite its importance for real-world application, there has been almost no study of steganalysis which produces very low false positives. We test existing and novel classifiers for their low false-positive performance, using millions of images from Flickr. Experiments on such a scale require considerable engineering. Standard steganalysis classifiers do not perform well in a low false-positive regime, and we make new proposals to penalize false positives more than false negatives.

Towards Scalable Network Host Simulation

  • Autoři: Stiborek, J., Rehák, M., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Autonomous Agents and Multiagent Systems. County of Richland: IFAAMAS, 2015, 14. ISSN 1548-8403. ISBN 978-1-4503-3771-7. Available from: http://www.lancaster.ac.uk/staff/suchj/acyse2015-proceedings/ACySe2015_submission_Stiborek.pdf
  • Rok: 2015
  • Pracoviště: Katedra počítačů
  • Anotace:
    Anomaly detection techniques in network security face signicant challenges on conguration and evaluation, as collecting data for accurate analysis is dicult or nearly impossible. One viable approach is to avoid live data collection and replace if by the agent-based simulation of the network trac with models of user's behavior. In this paper we propose three approaches diering by the level of detail with which user behavior is modeled.

Unsupervised Detection of Malware in Persistent Web Traffic

  • Autoři: Kohout, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2015. pp. 1757-1761. ISSN 1520-6149. ISBN 978-1-4673-6997-8.
  • Rok: 2015
  • DOI: 10.1109/ICASSP.2015.7178272
  • Odkaz: https://doi.org/10.1109/ICASSP.2015.7178272
  • Pracoviště: Katedra počítačů
  • Anotace:
    Persistent network communication can be found in many instances of malware. In this paper, we analyse the possibility of leveraging low variability of persistent malware communication for its detection. We propose a new method for capturing statistical fingerprints of connections and employ outlier detection to identify the malicious ones. Emphasis is put on using minimal information possible to make our method very lightweight and easy to deploy. Anomaly detection is commonly used in network security, yet to our best knowledge, there are not many works focusing on the persistent communication itself, without making further assumptions about its purpose.

A Memory Efficient Privacy Preserving Representation of Connection Graphs

  • Autoři: Rehák, M., Jusko, J., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: ACySE '14 Proceedings of the 1st International Workshop on Agents and CyberSecurity. New York: ACM, 2014. ISBN 978-1-4503-2728-2.
  • Rok: 2014
  • DOI: 10.1145/2602945.2602947
  • Odkaz: https://doi.org/10.1145/2602945.2602947
  • Pracoviště: Katedra počítačů
  • Anotace:
    Connection graphs are often used for network traffic classification and P2P networks analysis. With the appearance of Software Defined Networks (SDN), a novel approach to proactive distributed network management based on multi-agent paradigm, there is a need to develop specialized graph representations. Once transmitted between elements of SDN network, they provide answers to specific queries while protecting other information about the graph. In this paper we propose one such graph representation based on Bloom Filters and show that it provides considerable reduction of required memory and strong privacy while keeping low false positive rate that does not have negative impact on its intended use.

A mishmash of methods for mitigating the model mismatch mess

  • Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of SPIE Media Watermarking, Security, and Forensics 2014. Washington: SPIE, 2014, ISSN 0277-786X. ISBN 978-0-8194-9945-5. Available from: http://dx.doi.org/10.1117/12.2038908
  • Rok: 2014
  • DOI: 10.1117/12.2038908
  • Odkaz: https://doi.org/10.1117/12.2038908
  • Pracoviště: Katedra počítačů
  • Anotace:
    The model mismatch problem occurs in steganalysis when a binary classifier is trained on objects from one cover source and tested on another: an example of domain adaptation. It is highly realistic because a steganalyst would rarely have access to much or any training data from their opponent, and its consequences can be devastating to classifier accuracy. This paper presents an in-depth study of one particular instance of model mismatch, in a set of images from Flickr using one fixed steganography and steganalysis method, attempting to separate different effects of mismatch in feature space and find methods of mitigation where possible. We also propose new benchmarks for accuracy, which are more appropriate than mean error rates when there are multiple actors and multiple images, and consider the case of 3-valued detectors which also output `don't know'. This pilot study demonstrates that some simple feature-centering and ensemble methods can reduce the mismatch penalty considerably, but not completely remove it.

Explaining Anomalies with Sampling Random Forests

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M.
  • Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. p. 71-78. ISBN 978-80-87136-19-5.
  • Rok: 2014
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    The main objective of anomaly detection algo- rithms is finding samples deviating from the majority. Al- though a vast number of algorithms designed for this al- ready exist, almost none of them explain, why a particular sample was labelled as an anomaly. To address this is- sue, we propose an algorithm called Explainer, which re- turns the explanation of sample’s differentness in disjunc- tive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algo- rithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data- streams, and real-time problems. The correctness of Ex- plainer is demonstrated on a wide range of synthetic and real world datasets.

Explaining Anomalies with Sapling Random Forests

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Kopp, M.
  • Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. pp. 71-78. ISBN 978-80-87136-19-5.
  • Rok: 2014
  • Pracoviště: Katedra počítačů
  • Anotace:
    The main objective of anomaly detection algo- rithms is finding samples deviating from the majority. Al- though a vast number of algorithms designed for this al- ready exist, almost none of them explain, why a particular sample was labelled as an anomaly. To address this is- sue, we propose an algorithm called Explainer, which re- turns the explanation of sample’s differentness in disjunc- tive normal form (DNF), which is easy to understand by humans. Since Explainer treats anomaly detection algo- rithms as black-boxes, it can be applied in many domains to simplify investigation of anomalies. The core of Explainer is a set of specifically trained trees, which we call sapling random forests. Since their training is fast and memory efficient, the whole algorithm is lightweight and applicable to large databases, data- streams, and real-time problems. The correctness of Ex- plainer is demonstrated on a wide range of synthetic and real world datasets.

Interpreting and clustering outliers with sapling random forests

  • Autoři: Kopp, M., doc. Ing. Tomáš Pevný, Ph.D., Holeňa, M.
  • Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. pp. 61-67. ISBN 978-80-87136-19-5.
  • Rok: 2014
  • Pracoviště: Katedra počítačů
  • Anotace:
    The main objective of outlier detection is find- ing samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover in- teresting events within data. Consequently, a considerable amount of statistical and data mining techniques to iden- tify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was la- belled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as con- junctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by hu- mans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anoma- lies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our ap- proach is demonstrated on several synthetic and one real world datasets.

Interpreting and clustering outliers with sapling random forests

  • Autoři: Kopp, M., doc. Ing. Tomáš Pevný, Ph.D., Holeňa, M.
  • Publikace: Proceedings of the 14th conference ITAT 2014 – Workshops and Posters. Praha: Institute of Computer Science AS CR, 2014. p. 61-67. ISBN 978-80-87136-19-5.
  • Rok: 2014
  • Pracoviště: Centrum umělé inteligence
  • Anotace:
    The main objective of outlier detection is find- ing samples considerably deviating from the majority. Such outliers, often referred to as anomalies, are nowadays more and more important, because they help to uncover in- teresting events within data. Consequently, a considerable amount of statistical and data mining techniques to iden- tify anomalies was proposed in the last few years, but only a few works at least mentioned why some sample was la- belled as an anomaly. Therefore, we propose a method based on specifically trained decision trees, called sapling random forest. Our method is able to interpret the output of arbitrary anomaly detector. The explanation is given as a subset of features, in which the sample is most deviating, or as con- junctions of atomic conditions, which can be viewed as antecedents of logical rules easily understandable by hu- mans. To simplify the investigation of suspicious samples even more, we propose two methods of clustering anoma- lies into groups. Such clusters can be investigated at once saving time and human efforts. The feasibility of our ap- proach is demonstrated on several synthetic and one real world datasets.

Randomized Operating Point Selection in Adversarial Classification

  • DOI: 10.1007/978-3-662-44851-9_16
  • Odkaz: https://doi.org/10.1007/978-3-662-44851-9_16
  • Pracoviště: Katedra počítačů
  • Anotace:
    Security systems for email spam filtering, network intrusion detection, steganalysis, and watermarking, frequently use classifiers to separate malicious behavior from legitimate. Typically, they use a fixed operating point minimizing the expected cost / error. This allows a rational attacker to deliver invisible attacks just below the detection threshold. We model this situation as a non-zero sum normal form game capturing attacker’s expected payoffs for detected and undetected attacks, and detector’s costs for false positives and false negatives computed based on the Receiver Operating Characteristic (ROC) curve of the classifier. The analysis of Nash and Stackelberg equilibria reveals that using a randomized strategy over multiple operating points forces the rational attacker to design less efficient attacks and substantially lowers the expected cost of the detector. We present the equilibrium strategies for sample ROC curves from network intrusion detection system and evaluate the corresponding benefits.

Steganographic key leakage through payload metadata

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
  • Publikace: Proceedings of the 2nd ACM workshop on Information hiding and multimedia security. New York: ACM, 2014. pp. 109-114. ISBN 978-1-4503-2647-6.
  • Rok: 2014
  • DOI: 10.1145/2600918.2600921
  • Odkaz: https://doi.org/10.1145/2600918.2600921
  • Pracoviště: Katedra počítačů
  • Anotace:
    The only steganalysis attack which can provide absolute certainty about the presence of payload is one which finds the embedding key. In this paper we consider refined versions of the key exhaustion attack exploiting metadata such as message length or decoding matrix size, which must be stored along with the payload. We show simple errors of implementation lead to leakage of key information and powerful inference attacks; furthermore, complete absence of information leakage seems difficult to avoid. This topic has been somewhat neglected in the literature for the last ten years, but must be considered in real-world implementations.

The Steganographer is the Outlier: Realistic Large-Scale Steganalysis

  • Autoři: Ker, Andrew D., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: IEEE Transactions on Information Forensics and Security. 2014, 9(9), 1424-1435. ISSN 1556-6013.
  • Rok: 2014
  • DOI: 10.1109/TIFS.2014.2336380
  • Odkaz: https://doi.org/10.1109/TIFS.2014.2336380
  • Pracoviště: Katedra počítačů
  • Anotace:
    We present a method for a completely new kind of steganalysis to determine who, out of a large number of actors each transmitting a large number of objects, is hiding payload inside some of them. It has significant challenges, including unknown embedding parameters and natural deviation between innocent cover sources, which are usually avoided in steganalysis tested under laboratory conditions. Our method uses standard steganalysis features, the maximum mean discrepancy measure of distance, and ranks the actors by their degree of deviation from the rest: we show that it works reliably, completely unsupervised, when tested against some of the standard steganography methods available to nonexperts. We also determine good parameters for the detector and show that it creates a two-player game between the guilty actor and the steganalyst.

Anomaly detection by bagging

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Solving Complex Machine Learning Problems with Ensemble Methods. 2013, pp. 25-40. Available from: http://ama.imag.fr/COPEM/copem2013_proceedings.pdf
  • Rok: 2013
  • Pracoviště: Katedra počítačů
  • Anotace:
    Many contemporary domains, e.g. network intrusion detection, fraud detection, etc., call for an anomaly detector processing a continuous stream of data. This need is driven by the high rate of their acquisition, by limited resources for storing them, or by privacy issues. The data can be also non-stationary requiring the detector to continuously adapt to their change. A good detector for these domains should therefore have a low training and classification complexity, on-line training algorithm, and, of course, a good detection accuracy. This paper proposes a detector trying to meet all these criteria. The detector consists of multiple weak detectors, each implemented as a one dimensional histogram. The one-dimensional histogram was chosen because it can be efficiently created on-line, and probability estimates can be efficiently retrieved from it. This construction gives the detector linear complexity of training and classification with respect to the input dimension, number of samples, and number of weak detectors. The accuracy of the detector is compared to seven anomaly detectors from the prior art on the range of 36 classification problems from UCI database. Results show that despite detector's simplicity, its accuracy is competitive to that of more complex detectors with a substantially higher computational complexity.

Attacking the IDS learning processes

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Rehák, M., Komon, Martin
  • Publikace: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. Piscataway: IEEE, 2013. pp. 8687-8691. ISSN 1520-6149. ISBN 9781479903566.
  • Rok: 2013
  • DOI: 10.1109/ICASSP.2013.6639362
  • Odkaz: https://doi.org/10.1109/ICASSP.2013.6639362
  • Pracoviště: Katedra počítačů
  • Anotace:
    Abstract We study the problem of directed attacks on the learning process of an anomaly-based Intrusion Detection System (IDS). We assume that the attack is performed by a knowledgeable attacker with an access to system's inputs, outputs, and all internal states. The attacker uses his knowledge of the IDS (implemented as an ensemble of anomaly detection algorithms) and its internal states to design the strongest undetectable attack of a particular type. We have experimented with different attacks against several anomaly detection algorithms individually, and against their combination. We show that while the individual anomaly detection algorithms can be easily avoided by the worst-case attacker that we assume, it is nearly impossible to avoid them simultaneously. These results were achieved during the experiments performed on university network traffic and are consistent with theoretical hypothesis grounded in steganalysis and watermarking.

Moving Steganography and Steganalysis from the Laboratory into the Real World

  • Autoři: Ker, Andrew, Bas, Patrick, Bohme, Rainer, Cogranne, Remi, Craver, Scott, Filler, Tomas, Fridrich, Jessica, doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the first ACM workshop on Information hiding and multimedia security. New York: ACM Press, 2013. pp. 45-58. ISBN 978-1-4503-2081-8.
  • Rok: 2013
  • DOI: 10.1145/2482513.2482965
  • Odkaz: https://doi.org/10.1145/2482513.2482965
  • Pracoviště: Katedra počítačů
  • Anotace:
    There has been an explosion of academic literature on steganography and steganalysis in the past two decades. With a few exceptions, such papers address abstractions of the hiding and detection problems, which arguably have become disconnected from the real world. Most published results, including by the authors of this paper, apply "in laboratory conditions" and some are heavily hedged by assumptions and caveats; significant challenges remain unsolved in order to implement good steganography and steganalysis in practice. This position paper sets out some of the important questions which have been left unanswered, as well as highlighting some that have already been addressed successfully, for steganography and steganalysis to be used in the real world.

The Challenges of Rich Features in Universal Steganalysis

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Ker, Andrew D.
  • Publikace: Media Watermarking, Security, and Forensics 2013. Washington: SPIE, 2013. ISSN 0277-786X. ISBN 9780819494382.
  • Rok: 2013
  • DOI: 10.1117/12.2006790
  • Odkaz: https://doi.org/10.1117/12.2006790
  • Pracoviště: Katedra počítačů
  • Anotace:
    Contemporary steganalysis is driven by new steganographic rich feature sets, which consist of large numbers of weak features. Although extremely powerful when applied to supervised classification problems, they are not compatible with unsupervised universal steganalysis, because the unsupervised method cannot separate the signal (evidence of steganographic embedding) from the noise (cover content). This work tries to alleviate the problem, by means of feature extraction algorithms. We focus on linear projections informed by embedding methods, and propose a new method which we call calibrated least squares with the specific aim of making the projections sensitive to stego content yet insensitive to cover variation. Different projections are evaluated by their application to the anomaly detector from Ref. 1, and we are able to retain both the universality and the robustness of the method, while increasing its performance substantially.

Batch steganography in the real world

  • Autoři: Ker, Andrew, doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of ACM Workshop on Multimedia and Security. New York: ACM Press, 2012, pp. 1-10. ISBN 978-1-4503-1417-6.
  • Rok: 2012
  • DOI: 10.1145/2361407.2361409
  • Odkaz: https://doi.org/10.1145/2361407.2361409
  • Pracoviště: Katedra počítačů
  • Anotace:
    We examine the universal pooled steganalyzer of in two respects. First, we confirm that the method is applicable to a number of different steganographic embedding methods. Second, we consider the converse problem of how to spread payload between multiple covers, by testing different payload allocation strategies against the universal steganalyzer. We focus on practical options which can be implemented without new software or expert knowledge, and we test on real-world data. Concentration of payload into the minimal number of covers is consistently the least detectable option. We present additional investigations which explain this phenomenon, uncovering a nonlinear relationship between embedding distortion and payload. We conjecture that this is an unavoidable consequence of blind steganalysis. This is significant for both batch steganography and pooled steganalysis.

Co-occurrence steganalysis in high dimensions

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of SPIE, Volume 8303. Pennsylvania State University. University Park,: SPIE - International Society for Optical Engineering, 2012. ISSN 0277-786X. ISBN 9780819489500.
  • Rok: 2012
  • DOI: 10.1117/12.908914
  • Odkaz: https://doi.org/10.1117/12.908914
  • Pracoviště: Katedra počítačů
  • Anotace:
    The state of the art steganalytic features for spatial domain, and to some extent for transfer domains (DCT) as well, are based on histograms of co-occurrence of neighboring elements. The rationale behind is that neighboring pixels in digital images are correlated, which is caused by the smoothness of our world and by the usual image processing. The limitation of the histogram- based features is that they do not scale well with respect to the number of modeled neighboring elements, since the number of histogram bins (hence number of features) depends exponentially on this quality. The remedy adopted by the prior art is to sum values of neighboring bins together, which can be seen as a vector quantization controlled by the position of the quantization centers. so far the quantization centers has been determined manually according to the steganalyst. Here we propose to use Linde, Buso, and Gray algorithm in order to automatically find quantization centers maximizing the detection accuracy of resulting features. The quantization centers found by the proposed algorithm are experimentally compared to the ones used by the prior art in the steganalysis of Hugo algorithm. The results show a non-eligible improvements in the accuracy, especially when more complicated filtes and higher-order histograms are used.

Detecting anomalous network hosts by means of PCA

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Grill, M., Rehák, M.
  • Publikace: Proceedings of IEEE Workshop on Informations Forensics and Security. Piscataway: IEEE, 2012, pp. 103-106. ISBN 978-1-4673-2285-0.
  • Rok: 2012
  • Pracoviště: Katedra počítačů
  • Anotace:
    Abstract--- This paper focuses on the identification of anomalous hosts within a computer network with the motivation to detect attacks and/or other unwanted and suspicious traffic. The proposed detection method does not use content of packets, which enables the method to be used on encrypted networks. Moreover, the method has very low computational complexity allowing fast detection and response important for limitation of potential damages. Abstract--- The proposed method uses entropies of IP addresses and ports to build two complementary models of host's traffic based on principal component analysis. These two models are coupled with two orthogonal anomaly definitions, which gives four different detectors. Abstract--- The methods are evaluated and compared to prior art on one week long capture of traffic on university network. The experiments reveals that no single detector can detect all types of anomalies, which is expected and stresses the importance of ensemble approach towards intrusion detection.

From Blind to Quantitative Steganalysis

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Fridrich, Jessica, Ker, Andrew
  • Publikace: IEEE Transactions on Information Forensics and Security. 2012, 7(2), 445-454. ISSN 1556-6013.
  • Rok: 2012
  • DOI: 10.1109/TIFS.2011.2175918
  • Odkaz: https://doi.org/10.1109/TIFS.2011.2175918
  • Pracoviště: Katedra počítačů
  • Anotace:
    A quantitative steganalyzer is an estimator of the number of embedding changes introduced by a specific embedding operation. Since for most algorithms the number of embedding changes correlates with the message length, quantitative steganalyzers are important forensic tools. In this paper, a general method for constructing quantitative steganalyzers from features used in blind detectors is proposed. The core of the method is a support vector regression, which is used to learn the mapping between a feature vector extracted from the investigated object and the embedding change rate. To demonstrate the generality of the proposed approach, quantitative steganalyzers are constructed for a variety of steganographic algorithms in both JPEG transform and spatial domains. The estimation accuracy is investigated in detail and compares favorably with state-of-the-art quantitative steganalyzers.

Identifying a steganographer in realistic and heterogeneous data sets

  • Autoři: Ker, Andrew, doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of SPIE, Volume 8303. Pennsylvania State University. University Park,: SPIE - International Society for Optical Engineering, 2012. ISSN 0277-786X. ISBN 9780819489500.
  • Rok: 2012
  • DOI: 10.1117/12.910565
  • Odkaz: https://doi.org/10.1117/12.910565
  • Pracoviště: Katedra počítačů
  • Anotace:
    We consider the problem of universal pooled steganalysis, in which we aim to identify a steganographer who sends many images (some of them innocent) in a network of many other innocent users. The detector must deal with multiple users and multiple images per user, and particularly the differences between cover sources used by different users. Despite being posed for five years, this problem has only previously been addressed by our 2011 paper. We extend our prior work in two ways. First, we present experiments in a new, highly realistic, domain: up to 4000 actors each transmitting up to 200 images, real-world data downloaded from a social networking site. Second, we replace hierarchical clustering by the method called local outlier factor (LOF), giving greater accuracy of detection, and allowing a guilty actor sending moderate payloads to be detected, even amongst thousands of other actors sending hundreds of thousands of images.

"Break Our Steganographic System" --- the ins and outs of organizing BOSS

  • Autoři: Bas, P., Filler, T., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of the 13th international conference on Information hiding. Heidelberg: Springer, 2011, pp. 59-70. Lecture notes in computer science. ISBN 978-3-642-24177-2.
  • Rok: 2011
  • Pracoviště: Katedra počítačů
  • Anotace:
    This paper summarizes the first international challenge on steganalysis called BOSS. We explain the motivations behind the organization of the contest, its rules together with reasons for them, and the steganographic algorithm developed for the contest. Since the image databases created for the contest significantly influenced the development of the contest, they are described in a great detail. Paper also presents detailed analysis of results submitted to the challenge. One of the main difficulty of the contest was the discrepancy between training and testing source of images -- the so-called cover-source mismatch, which forced the participants to design steganalyzers robust w.r.t. a specific source of images. We also point to other practical issues related to designing steganographic systems and give several suggestions for future contests in steganalysis.

A New Paradigm for Steganalysis via Clustering

  • Autoři: Ker, A.D., doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of SPIE Volume: 7880. Bellingham: SPIE, 2011. p. 78800U-78813U. ISSN 0277-786X. ISBN 978-0-8194-8417-8.
  • Rok: 2011
  • Pracoviště: Katedra počítačů
  • Anotace:
    We propose a new paradigm for blind, universal, steganalysis in the case when multiple actors transmit multiple objects, with guilty actors including some stego objects in their transmissions. The method is based on clustering rather than classification, and it is the actors which are clustered rather than their individual transmitted objects. This removes the need for training a classifier, and the danger of training model mismatch. It effectively judges the behaviour of actors by assuming that most of them are innocent: after performing agglomerative hierarchical clustering, the guilty actor(s) are clustered separately from the innocent majority. A case study shows that this works in the case of JPEG images. Although it is less sensitive than steganalysis based on specifically-trained classifiers, it requires no training, no knowledge of the embedding algorithm, and attacks the pooled steganalysis problem.

Detecting messages of unknown length

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D.,
  • Publikace: Proceedings of SPIE Volume: 7880. Bellingham: SPIE, 2011. pp. 78800T-78812T. ISSN 0277-786X. ISBN 978-0-8194-8417-8.
  • Rok: 2011
  • Pracoviště: Katedra počítačů
  • Anotace:
    This work focuses on the problem of developing a blind steganalyzer (a steganalyzer relying on machine learning algorithm and steganalytic features) for detecting stego images with different payload. This problem is highly relevant for practical forensic analysis, since in practice, the knowledge about the steganographic channel is very limited, and the length of hidden message is generally unknown. This paper demonstrates that the discrepancy between payload in training and testing / application images can significantly decrease the accuracy of the steganalysis. Two fundamentally different approaches to mitigate this problem are then proposed. The first solution relies on quantitative steganalyzer. The second solution transforms one-sided hypothesis test (unknown message length) to simple hypothesis test by assuming a probability distribution on length of messages, which can be efficiently solved by many machine-learning tools, e.g. by Support Vector Machines. The experimental section of the paper (a) compares both solutions on steganalysis of F5 algorithm with shrinkage removed by wet paper codes for JPEG images and LSB matching for raw (uncompressed) images, (b) investigates the effect of the assumed distribution of the message length on the accuracy of the steganalyzer, and (c) shows how the accuracy of steganalysis depends on Eve\'s knowledge about details of steganographic channel.

Modern Steganalysis Can Detect YASS

  • Autoři: Kodovský, J., doc. Ing. Tomáš Pevný, Ph.D., Fridrich, J.
  • Publikace: Media Forensics and Security II. Washington: SPIE, 2010. p. 1-11. ISSN 0277-786X. ISBN 978-0-8194-7934-1.
  • Rok: 2010
  • DOI: 10.1117/12.838768
  • Odkaz: https://doi.org/10.1117/12.838768
  • Pracoviště: Katedra kybernetiky
  • Anotace:
    YASS is a steganographic algorithm for digital images that hides messages robustly in a key-dependent transform domain so that the stego image can be subsequently compressed and distributed as JPEG. Given the fact that state-of-the-art blind steganalysis methods of 2007, when YASS was proposed, were unable to reliably detect YASS, in this paper we steganalyze YASS using several recently proposed general-purpose steganalysis feature sets. The focus is on blind attacks that do not capitalize on any weakness of a specific implementation of the embedding algorithm. We demonstrate experimentally that twelve different settings of YASS can be reliably detected even for small embedding rates and in small images. Since none of the steganalysis feature sets is in any way targeted to the embedding of YASS, future modifications of YASS will likely be detectable by them as well

Steganalysis by Subtractive Pixel Adjacency Matrix

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Bas, T., Fridrich, J.
  • Publikace: IEEE Transactions on Information Forensics and Security. 2010, 2 215-224. ISSN 1556-6013.
  • Rok: 2010
  • DOI: 10.1109/TIFS.2010.2045842
  • Odkaz: https://doi.org/10.1109/TIFS.2010.2045842
  • Pracoviště: Katedra kybernetiky
  • Anotace:
    This paper presents a method for detection of steganographic methods that embed in the spatial domain by adding a low-amplitude independent stego signal, an example of which is least significant bit (LSB) matching. First, arguments are provided for modeling the differences between adjacent pixels using first-order and second-order Markov chains. Subsets of sample transition probability matrices are then used as features for a steganalyzer implemented by support vector machines. The major part of experiments, performed on four diverse image databases, focuses on evaluation of detection of LSB matching. The comparison to prior art reveals that the presented feature set offers superior accuracy in detecting LSB matching. Even though the feature set was developed specifically for spatial domain steganalysis, by constructing steganalyzers for ten algorithms for JPEG images, it is demonstrated that the features detect steganography in the transform domain as well.

Using High-Dimensional Image Models to Perform Highly Undetectable Steganography

  • Autoři: doc. Ing. Tomáš Pevný, Ph.D., Filler, T., Bas, P.
  • Publikace: Information Hiding, Lecture Notes in Computer Science. Berlin: Springer, 2010. p. 161-177. ISSN 0302-9743. ISBN 978-3-642-16434-7.
  • Rok: 2010
  • DOI: 10.1007/978-3-642-16435-4_13
  • Odkaz: https://doi.org/10.1007/978-3-642-16435-4_13
  • Pracoviště: Katedra kybernetiky
  • Anotace:
    This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 10^{7}. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching.

Za stránku zodpovídá: Ing. Mgr. Radovan Suk