We present ChunkyGAN—a novel paradigm for modeling and editing images using generative adversarial networks. Unlike previous techniques seeking a global latent representation of the input image, our approach subdivides the input image into a set of smaller components (chunks) specified either manually or automatically using a pre-trained segmentation network. For each chunk, the latent code of a generative network is estimated locally with greater accuracy thanks to a smaller number of constraints. Moreover, during the optimization of latent codes, segmentation can further be refined to improve matching quality. This process enables high-quality projection of the original image with spatial disentanglement that previous methods would find challenging to achieve. To demonstrate the advantage of our approach, we evaluated it quantitatively and also qualitatively in various image editing scenarios that benefit from the higher reconstruction quality and local nature of the approach. Our method is flexible enough to manipulate even out-of-domain images that would be hard to reconstruct using global techniques.
We propose a CNN-based approach to classify ten
genres of ballroom dances given audio recordings, five latin and
five standard, namely Cha Cha Cha, Jive, Paso Doble, Rumba,
Samba, Quickstep, Slow Foxtrot, SlowWaltz, Tango and Viennese
Waltz. We utilize a spectrogram of an audio signal and we treat
it as an image that is an input of the CNN. The classification
is performed independently by 5-seconds spectrogram segments
in sliding window fashion and the results are then aggregated.
The method was tested on following datasets: Publicly available
Extended Ballroom dataset collected by Marchand and Peeters,
2016 and two YouTube datasets collected by us, one in studio
quality and the other, more challenging, recorded on mobile
phones. The method achieved accuracy 93.9%, 96.7% and 89.8%
respectively. The method runs in real-time. We implemented a web application to demonstrate the proposed method.
We propose a neural network which takes two inputs, a hair image and a face image, and produces an output image having the hair of the hair image seamlessly merged with the inner face of the face image. Our architecture consists of neural networks mapping the input images into a latent code of a pretrained StyleGAN2 which generates the output high-definition image. We propose an algorithm for training parameters of the architecture solely from synthetic images generated by the StyleGAN2 itself without the need of any annotations or external dataset of hairstyle images. We empirically demonstrate the effectiveness of our method in applications including hair-style transfer, hair generation for 3D morphable models, and hair-style interpolation. Fidelity of the generated images is verified by a user study and by a novel hairstyle metric proposed in the paper.
Self-Supervised Learning of Camera-based Drivable Surface Friction
The visual predictor of a drivable surface friction ahead of the vehicle is presented. The image recognition neural network is trained in self-supervised fashion, as an alternative to tedious, error-prone, and subjective human annotation. The training images are labelled automatically by surface friction estimates from vehicle response during ordinary driving. The Unscented Kalman Filter algorithm is used to estimate tire-to-road interface friction parameters, taking into account the highly nonlinear nature of tire dynamics. Finally, the overall toolchain was validated using an experimental subscale platform and real-world driving scenarios. The resulting visual predictor was trained using about 3 000 images and validated on an unseen set of 800 test images, achieving 0.98 crosscorrelation between the visually predicted and the estimated value of surface friction.
Self-Supervised Learning of Camera-based Drivable Surface Roughness
A self-supervised method to train a visual predictor of drivable surface roughness in front of a vehicle is proposed. A convolutional neural network taking a single camera image is trained on a dataset labeled automatically by a cross-modal supervision. The dataset is collected by driving a vehicle on various surfaces, while synchronously recording images and accelerometer data. The surface images are labeled by the local roughness measured using the accelerometer signal aligned in time. Our experiments show that the proposed training scheme results in accurate visual predictor. The correlation coefficient between the visually predicted roughness and the true roughness (measured by the accelerometer) is 0.9 on our independent test set of about 1000 images. The proposed method clearly outperforms a baseline method which has the correlation of 0.3 only. The baseline is based on surface texture strength without any training. Moreover, we show a coarse map of local surface roughness, which is implemented by scanning an input image with the trained convolutional network. The proposed method provides automatic and objective road condition assessment, enabling a cheap and reliable alternative to manual data annotation, which is infeasible in a large scale.
We developed an autonomous driving system that can chase another vehicle using only images from a single RGB camera. At the core of the system is a novel dual-task convolutional neural network simultaneously performing object detection as well as coarse semantic segmentation. The system was firstly tested in CARLA simulations. We created a new challenging publicly available CARLA Car Chasing Dataset collected by manually driving the chased car. Using the dataset, we showed that the system that uses the semantic segmentation was able to chase the pursued car on average 16% longer than other versions of the system. Finally, we integrated the system into a sub-scale vehicle platform built on a high-speed RC car and demonstrated its capabilities by autonomously chasing another RC car.
The concept of a visually-assisted anti-lock braking system (ABS) is presented. The road conditions in front of the vehicle are assessed in real time based on camera data. The surface type classification (tarmac, gravel, ice, etc.) and related road friction properties are provided to the braking control algorithm in order to adjust the vehicle response accordingly. The system was implemented and tested in simulations as well as on an instrumented sub-scale vehicle. Simulations and experimental results quantitatively demonstrate the benefits of the proposed system in critical maneuvers, such as emergency braking and collision avoidance.
In situations when potentially costly decisions are being made, faces of people tend to reflect a level of certainty about the appropriateness of the chosen decision. This fact is known from the psychological literature. In the paper, we propose a method that uses facial images for automatic detection of the decision ambiguity state of a subject. To train and test the method, we collected a large-scale dataset from "Who Wants to Be a Millionaire?" -- a popular TV game show. The videos provide examples of various mental states of contestants, including uncertainty, doubts and hesitation. The annotation of the videos is done automatically from on-screen graphics. The problem of detecting decision ambiguity is formulated as binary classification. Video-clips where a contestant asks for help (audience, friend, 50:50) are considered as positive samples; if he (she) replies directly as negative ones. We propose a baseline method combining a deep convolutional neural network with an SVM. The method has an error rate of 24%. The error of human volunteers on the same dataset is 45%, close to chance.
Learning of convolutional neural networks (CNNs) to perform a face recognition task requires a large set of facial images each annotated with a label to be predicted. In this paper we propose a method for learning CNNs from weakly annotated images. The weak annotation in our setting means that a pair of an attribute label and a person identity label is assigned to a set of faces automatically detected in the image. The challenge is to link the annotation with the correct face. The weakly annotated images of this type can be collected by an automated process not requiring a human labor. We formulate learning from weakly annotated images as a maximum likelihood (ML) estimation of a parametric distribution describing the weakly annotated images. The ML problem is solved by an instance of the EM algorithm which in its inner loop learns a CNN to predict attribute label from facial images. Experiments on age and gender estimation problem show that the proposed algorithm significantly outperforms the existing heuristic approach for dealing with this type of data. A practical outcome of our paper is a new annotation of the IMDB database  containing 300 k faces each one annotated by biological age, gender and identity labels.
Non-contact reflectance photoplethysmography: Progress, limitations, and myths
Photoplethysmography (PPG) is a non-invasive method of measuring changes of blood volume in human tissue. The literature on non-contact reflectance PPG related to cardiovascular activity is extensively reviewed. We identify key factors limiting the performance of the PPG methods and reproducibility of the research as: a lack of publicly available datasets and incomplete description of data used in published
experiments (missing details on video compression, lighting setup and subject’s skin type), use of unreliable pulse oximeter devices for ground-truth reference and missing standard experimental protocols. Two experiments with 5 participants are presented showing that the quality of the reconstructed signal (1) is adversely affected by a reduction of spatial resolution that also amplifies the effects of H.264 video compression and (2) is improved by precise pixel-to-pixel stabilization.
Visual Heart Rate Estimation with Convolutional Neural Network
We propose a novel two-step convolutional neural network to estimate a heart rate
from a sequence of facial images. The network is trained end-to-end by alternating op-
timization and validated on three publicly available datasets yielding state-of-the-art results against three baseline methods. The network performs better by a 40% margin to the state-of-the-art method on a newly collected dataset. A challenging dataset of 204 fitness-themed videos is introduced. The dataset is designed to test the robustness of heart rate estimation methods to illumination changes and subject’s motion. 17 subjects perform 4 activities (talking, rowing, exercising on a stationary bike and an elliptical trainer) in 3 lighting setups. Each activity is captured by two RGB web-cameras, one is placed on a tripod, the other is attached to the fitness machine which vibrates significantly. Subject’s age ranges from 20 to 53 years, the mean heart rate is
≈ 110, the standard deviation ≈ 25.
Learning CNNs for face recognition from weakly annotated images
Supervised learning of convolutional neural networks (CNNs) for face recognition requires a large set of facial images each annotated with a single attribute label to be predicted. In this paper we propose a method for learning CNNs from weakly annotated images. The weak annotation in our setting means that a pair of an attribute label and a person identity label is assigned to a set of faces automatically detected in the image. The challenge is to link the annotation with the correct face. The weakly annotated images of this type can be collected by an automated process not requiring a human labor. We formulate learning from weakly annotated images as a maximum likelihood estimation of a parametric distribution describing the data. The ML problem is solved by an instance of EM algorithm which in its inner loop learns a CNN to perform given face recognition task. Experiments on age and gender estimation problem show that the proposed EM-CNN algorithm significantly outperforms the state-of-the-art approach for dealing with this type of data.
Micro-expressions are quick facial motions, appearing in high stake and stressful situations typically when a subject tries to hide his or her emotions. Two attributes are present - fast duration and low intensity. A simple detection method is proposed,
which determines instants of micro-expressions in a video. The method is based on analyzing image intensity differences over a registered face sequence.
The specific pattern is detected by an SVM classifier. The results are evaluated on standard microexpression datasets SMIC-E and CASMEII. The proposed
method outperformed competing methods in detection accuracy. Further, we collected a new real micro-expression dataset of mostly poker game videos downloaded from YouTube. We achieved average cross-validation AUC 0.88 for the SMIC, and
0.81 on the new challenging “in the Wild” database.
Visual Descriptors in Methods for Video Hyperlinking
In this paper, we survey different state-of-the-art visual processing methods and utilize them in hyperlinking. Visual information, calculated using Features Signatures, SIMILE descriptors and convolutional neural networks (CNN), is utilized as similarity between video frames and used to find similar faces, objects and setting. Visual concepts in frames are also automatically recognized and textual output of the recognition is combined with search based on subtitles and transcripts. All presented experiments were performed in the Search and Hyperlinking 2014 MediaEval task and Video Hyperlinking 2015 TRECVid task.
Visual Language Identification from Facial Landmarks
The automatic Visual Language IDentification (VLID), i.e. a problem of using visual information to identify the language being spoken, using no audio information, is studied. The proposed method employs facial landmarks automatically detected in a video. A convex optimisation problem to find jointly both the discriminative representation (a softhistogram over a set of lip shapes) and the classifier is formulated. A 10-fold cross-validation is performed on dataset consisting of 644 videos collected from youtube.com resulting in accuracy 73% in a pairwise iscrimination
between English and French (50% for a chance).Astudy, inwhich 10 videos were used, suggests that the proposed method performs better than average human in discriminating between the languages.
Multi-view facial landmark detection by using a 3D shape model
An algorithm for accurate localization of facial landmarks coupled with a head pose estimation from a single monocular image is proposed. The algorithm is formulated as an optimization problem where the sum of individual landmark scoring functions is maximized with respect to the camera pose by fitting a parametric 3D shape model. The landmark scoring functions are trained by a structured output SVM classifier that takes a distance to the true landmark position into account when learning. The optimization criterion is non-convex and we propose a robust initialization scheme which employs a global method to detect a raw but reliable initial landmark position. Self-occlusions causing landmarks invisibility are handled explicitly by excluding the corresponding contributions from the data term. This allows the algorithm to operate correctly for a large range of viewing angles. Experiments on standard ``in-the-wild'' datasets demonstrate that the proposed algorithm outperforms several state-of-the-art landmark detectors especially for non-frontal face images. The algorithm achieves the average relative landmark localization error below 10% of the interocular distance in 98.3% of the 300W dataset test images.
Real-Time Eye Blink Detection using Facial Landmarks
A real-time algorithm to detect eye blinks in a video sequence from a standard camera is proposed. Recent landmark detectors, trained on in-the-wild datasets exhibit excellent robustness against a head orientation with respect to a camera, varying illumination and facial expressions. We show that the landmarks are detected precisely enough to reliably estimate the level of the eye opening. The proposed algorithm therefore estimates the landmark positions, extracts a single scalar quantity, eye aspect ratio (EAR), characterizing the eye opening in each frame. Finally, an SVM classifier detects eye blinks as a pattern of EAR values in a short temporal window. The simple algorithm outperforms the state-of-the-art results on two standard datasets.
Continuous Action Recognition Based on Sequence Alignment
Continuous action recognition is more challenging than isolated recognition beca use classification and segmentation must be simultaneously carried out. We build on the well k nown dynamic time warping framework and devise a novel visual alignment technique, namely dyna mic frame warping (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two ex tensions which enable to perform recognition concomitant with segmentation, namely one-pass DF W and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action reco gnition has been overlooked. We test and illustrate the proposed techniques with a recently re leased dataset (RAVEL) and with two public-domain datasets widely used in action recognition ( Hollywood-1 and Hollywood-2).We also compare the performances of the proposed isolated and con tinuous recognition algorithms with several recently published methods.
Similarity Search and Applications, Proc. 8th International Conference, SISAP 2015. Cham: Springer, 2015. p. 237-243. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-319-25086-1.
We present an efficiency evaluation of similarity search techniques applied on visual features from deep neural networks. Our test collection consists of 20 million 4096-dimensional descriptors (320 GB of data). We test approximate k-NN search using several techniques, specifically FLANN library (a popular in-memory implementation of k-d tree forest), M-Index (that uses recursive Voronoi partitioning of a metric space), and PPP-Codes, which work with memory codes of metric objects and use disk storage for candidate refinement. Our evaluation shows that as long as the data fit in main memory, the FLANN and the M-Index have practically the same ratio between precision and response time. The PPP-Codes identify candidate sets ten times smaller then the other techniques and the response times are around 500 ms for the whole 20M dataset stored on the disk. The visual search with this index is available as an online demo application. The collection of 20M descriptors is provided as a public dataset to academic community.
Erratum to: Continuous Action Recognition Based on Sequence Alignment
A real-time algorithm for accurate localization of facial landmarks in a single monocular image is proposed. The algorithm is formulated as an optimization problem, in which the sum of responses of local classifiers is maximized with respect to the camera pose by fit ting a generic (not a person specific) 3D model. The algorithm simultaneously estimates a head position and orientation and detects the facial landmarks in the image. Despite being local, we show that the basin of attraction is large to the extent it can be initialized by a scannin g window face detector. Other experiments on standard datasets demonstrate that the proposed a lgorithm outperforms a state-of-the-art landmark detector especially for non-frontal face imag es, and that it is capable of reliable and stable tracking for large set of viewing angles.
Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation
Text, Speech, and Dialogue. 17th International Conference, TSD 2014. Heidelberg: Springer, 2014. p. 465-472. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-10815-5.
Our goal is to create speaker models in audio domain and face models in video do main from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question "Who was speaking and when") and/or fo r face recognition ("Who was seen and when") for given videos that contain speaking persons. T he proposed system is based on an audio-video diarization system that tries to resolve the dis advantages of the individual modalities. Experiments on broadcasts of Czech parliament meeting s show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).
Active-Speaker Detection and Localization with Microphones and Cameras Embedded into a Robotic Head
Autoři:Ing. Jan Čech, Ph.D., Mittal, R., Deleforge, A., Sanchez-Riera, J., Alameda-Pineda, X., Horaud, R.
Proc. Humanoids 2013: IEEE International Conference on Humanoid Robots. Piscataway: IEEE Robotics and Automation Society, 2013. p. 203-210. ISBN 978-1-4799-2618-3.
In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker that emits a sound, through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones. Both the cameras and the microphones are embedded into the head of a humanoid robot. The proposed statistical fusion model associates 3D faces of potential speakers with 2D sound directions. The paper has two contributions: (i) a method that discretizes the two-dimensional space of all possible sound directions and that accumulates evidence for each direction by estimating the time difference of arrival (TDOA) over all the microphone pairs, such that all the microphones are used simultaneously and symmetrically and (ii) an audio-visual alignment method that maps 3D visual features onto 2D sound directions and onto TDOAs between microphone pairs. This allows to implicitly represent both sensing modalities into a common audiovisual coordinate frame. Using simulated as well as real data, we quantitatively assess the robustness of the method against noise and reverberations, and we compare it with several other methods. Finally, we describe a real-time implementation using the proposed technique and with a humanoid head embedding four microphones and two cameras: this enables natural human-robot interactive behavior.
RAVEL: an annotated corpus for training robots with audiovisual abilities
We introduce Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios. These scenarios are recorded using the audio-visual robot head POPEYE, equipped with two cameras and four microphones, two of which being plugged into the ears of a dummy head. All the recordings were performed in a standard room with no special equipment, thus providing a challenging indoor scenario. This data set provides a basis to test and benchmark methods and algorithms for audio-visual scene analysis with the ultimate goal of enabling robots to interact with people in the most natural way. The data acquisition setup, sensor calibration, data annotation and data content are fully detailed. Moreover, three examples of using the recorded data are provided, illustrating its appropriateness for carrying out a large variety of HRI experiments. The Ravel data are publicly available at: http://ravel.humavips.eu/.
Efficient Sequential Correspondence Selection by Cosegmentation
Object recognition and wide baseline stereo methods, correspondences of interest points (distinguished regions) are commonly established by matching compact descriptors such as SIFTs. We show that a subsequent cosegmentation process coupled with a quasi-optimal sequential decision process leads to a correspondence verification procedure that (i) has high precision (ii) has good recall and (iii) is fast. The sequential decision on the correctness of a correspondence is based on simple statistics of a modified dense stereo matching algorithm. The statistics are projected on a prominent discriminative direction by SVM. Wald's sequential probability ratio test is performed on the SVM projection computed on progressively larger cosegmented regions.We show experimentally that the proposed Sequential Correspondence Verification (SCV) algorithm significantly outperforms the correspondence selection method based on SIFT distance ratios.
Languages for Constrained Binary Segmentation Based on Maximum A Posteriori Probability Labeling
MRF with asymmetric pairwise compatibility constraints between direct pixel neighbors solves a constrained binary image segmentation task. The model is constraining shape and alignment of individual contiguous binary segments by introducing auxiliary labels and their pairwise interactions. Such representation is not necessarily unique. We study several ad-hoc labeling models for binary images consisting of nonoverlapping rectangular contiguous regions. We observed a noticeable increase in performance even in cases when the differences between the models were seemingly insignificant. We use the proposed models for segmentation of windowpanes and windows in orthographically rectified facade images. We show experimentally that even very weak data model in the MAP formulation of the optimal segmentation problem gives very good segmentation results.
Efficient Sequential Correspondence Selection by Cosegmentation
CVPR 2008: Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Medison: Omnipress, 2008. pp. 1020-1027. ISSN 1063-6919. ISBN 978-1-4244-2242-5.
In many retrieval, object recognition and wide baseline stereo methods, correspondences of interest points (distinguished regions, transformation covariant points) are established possibly sublinearly by matching a compact descriptor such as SIFT. We show that a subsequent cosegmentation process coupled with a quasi-optimal sequential decision process leads to a correspondence verification procedure that has (i) high precision (is highly discriminative) (ii) good recall and (iii) is fast. The sequential decision on the correctness of a correspondence is based on trivial attributes of a modified dense stereo matching algorithm. The attributes are projected on a prominent discriminative direction by SVM. Wald's sequential probability ratio test is performed for SVM projection computed on progressively larger co-segmented regions. Experimentally we show that the process significantly outperforms the standard correspondence selection process based on SIFT distance ratios on challenging mat
Performance Evaluation of Building Detection and Digital Surface Model Extraction Algorithms: Outcomes of the PRRS 2008 Algorithm Performance Contest
Aksoy, S., Özdemir, B., Eckert, S., Kayitakire, F., Pesaresi, M., Aytekin, O., Borel, C.C., Ing. Jan Čech, Ph.D., Christophe, E., Düzgün, S., Erener, A., Ertugay, K., Hussain, E., Inglada, J., Lefévre, S., Ok, Ö., Koc, D., doc. Dr. Ing. Radim Šára, Shan, J., Soman, J., Ulusoy, I., Witz, R.
PRRS 2008: Proceedings of the 5th IAPR Workshop on Pattern Recognition in Remote Sensing. Piscataway: IEEE, 2008. p. 37-48. ISBN 978-1-4244-2653-9.
This paper presents the initial results of the Algorithm Performance Contest that was organized as part of the 5th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS 2008). The focus of the 2008 contest was automatic building detection and digital surface model (DSM) extraction. A QuickBird data set with manual ground truth was used for building detection evaluation, and a stereo Ikonos data set with a highly accurate reference DSM was used for DSM extraction evaluation. Nine submissions were received for the building detection task, and three submissions were received for the DSM extraction task. We provide an overview of the data sets, the summaries of the methods used for the submissions, the details of the evaluation criteria, and the results of the initial evaluation.
Windowpane Detection based on Maximum Aposteriori Probability Labeling
Image Analysis - From Theory to Applications, Proceedings of the 12th International Workshop on Combinatorial Image Analysis (IWCIA'08). Singapore: Research Publishing Services, 2008. pp. 3-11. ISBN 978-3-540-78274-2.
Segmentation of windowpanes in images of building facades is formulated as a task of maximum aposteriori probability labeling. Assuming orthographic rectification of the image, the windowpanes are always axis-parallel rectangles of relatively low variability in appearance. Every image pixel has one of 10 possible labels, and the labels in adjacent pixels are constrained by allowed label configuration, such that the image labels represent a set of non-overlapping rectangles. The task of finding the most probable labeling of a given image leads to NP-hard discrete optimization problem. However, we find an approximate solution using a general solver suitable for such problems and we obtain promising results which we demonstrate on several experiments. Substantial difference between the presented paper and the state-of-the-art papers on segmentation based on Markov Random Fields is that we have a strong structure model, forcing the labels to form rectangles, while other methods do not mode
Efficient Sampling of Disparity Space for Fast and Accurate Matching
A simple stereo matching algorithm is proposed that visits only a small fraction of disparity space in order to find a semi-dense disparity map. It works by growing from a small set of correspondence seeds. Unlike in known seedgrowing algorithms, it guarantees matching accuracy and correctness, even in the presence of repetitive patterns. This success is based on the fact it solves a global optimization task. The algorithm can recover from wrong initial seeds to the extent they can even be random. The quality of correspondence seeds influences computing time, not the quality of the final disparity map. We show that the proposed algorithm achieves similar results as an exhaustive disparity space search but it is two orders of magnitude faster. This is very unlike the existing growing algorithms which are fast but erroneous.
Feasibility Boundary in Dense and Semi-Dense Stereo Matching
In stereo literature, there is no standard method for evaluating algorithms for semi-dense stereo matching. Moreover, existing evaluations for dense methods require a fixed parameter setting for the tested algorithms. In this paper, we propose a method that overcomes these drawbacks and still is able to compare algorithms based on a simple numerical value, so that reporting results does not take up much space in a paper. We propose evaluation of stereo algorithms based on Receiver Operating Characteristics (ROC) which captures both errors and sparsity. By comparing ROC curves of all tested algorithms we obtain the Feasibility Boundary, the best possible performance achieved by a set of tested stereo algorithms, which allows stereo algorithm users to select the proper method and parameter setting for a required application.
Complex Correlation Statistic for Dense Stereoscopic Matching
A traditional solution of area-based stereo uses some kind of windowed pixel intensity correlation. We introduce a new correlation statistic, which is completely invariant to image sampling, moreover it naturally provides a position of the correlation maximum between pixels. Hereby we can obtain sub-pixel disparity directly from sampling invariant and highly discriminable measurements without any postprocessing of the discrete disparity map.
Theory and Robust Algorithm of Trinocular Rectification
The main contributions are two-fold: Firstly, some theoretical analyses are carried out on trinocular rectification, including the relationship among the three rectified images and their three fundamental matrices, and an geometric interpretation of the 6 free parameters involved in the rectification process. Such results could be used as a theoretical guide to reduce the induced projective distortion. Secondly, under the RANSAC (random sampling consensus) paradigm, a robust trinocular rectification algorithm is proposed. Unlike the traditional ones where only the fundamental matrices are used to rectify images, this algorithm instead uses directly corresponding points for the rectification.
A Linear Trinocular Rectification Method for Accurate Stereoscopic Matching
In this paper we propose and study a simple trinocular rectification method in which stratification to projective and affine components gives the rectifying homographies in a closed form. The class of trinocular rectifications which has 6 DOF is parametrized by an independent set of parameters with a geometric meaning. This offers the possibility to minimize rectification distortion in a natural way. It is shown experimentally on real data that our algorithm performs the rectification task correctly. As shown on ground-truth data using Confidently Stable Matching, trinocular matching significantly improves disparity map density and mismatch error, both depending on texture strength. Matching results on real complex scenes are reported.
Dense Stereomatching Algorithm Performance for View Prediction and Structure Reconstruction
The knowledge of stereo matching algorithm properties and behaviour under varying conditions is crucial for the selection of a proper method for the desired application. In this paper we study the behaviour of four representative matching algorithms under varying signal-to-noise ratio in six types of error statistics. The errors are focused on basic matching failure mechanisms and their definition observes the principles of independence, symmetry and completeness. A ground truth experiment shows that the best choice for view prediction is the Graph Cuts algorithm and for structure reconstruction it is the Confidently Stable Matching.
Estimation of the Temporomandibular Joint Position