Persons

doc. Ing. Petr Pollák, CSc.

All publications

Automatic Phonetic Segmentation and Pronunciation Detection with Various Approaches of Acoustic Modeling

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech and Computer. Basel: Springer, 2018. p. 419-429. LNAI. vol. 11096. ISSN 0302-9743. ISBN 978-3-319-99578-6.
  • Year: 2018
  • DOI: 10.1007/978-3-319-99579-3_44
  • Link: https://doi.org/10.1007/978-3-319-99579-3_44
  • Department: Department of Circuit Theory
  • Annotation:
    The paper describes HMM-based phonetic segmentation realized by KALDI toolkit with the focus on study of accuracy of various acoustic modeling such as GMM-HMM vs. DNN-HMM, monophone vs. triphone, speaker independent vs. speaker dependent. The analysis was performed with TIMIT database and it proved the contribution of advanced acoustic modeling, especially for the choice of a proper pronunciation variant. For this purpose, the lexicon covering the pronunciation variability among TIMIT speakers was created on the basis of phonetic transcriptions available in TIMIT corpus. When the proper sequence of phones is recognized by DNN-HMM system, more precise boundary placement can be then obtained using basic monophone acoustic models.

Dithering techniques in automatic recognition of speech corrupted by MP3 compression: Analysis, solutions and experiments

  • Authors: Borský, M., Mizera, P., doc. Ing. Petr Pollák, CSc., Nouza, J.
  • Publication: Speech Communication. 2017, 86 75-84. ISSN 0167-6393.
  • Year: 2017
  • DOI: 10.1016/j.specom.2016.11.007
  • Link: https://doi.org/10.1016/j.specom.2016.11.007
  • Department: Department of Circuit Theory
  • Annotation:
    A large portion of the audio files distributed over the Internet or those stored in personal and corporate media archives are in a compressed form. There exist several compression techniques and algorithms but it is the MPEG Layer-3 (known as MP3) that has achieved a really wide popularity in general audio coding, and in speech, too. However, the algorithm is lossy in nature and introduces distortion into spectral and temporal characteristics of a signal. In this paper we study its impact on automatic speech recognition (ASR). We show that with decreasing MP3 bitrates the major source of ASR performance degradation is deep spectral valleys (i.e. bins with almost zero energy) caused by the masking effect of the MP3 algorithm. We demonstrate that these unnatural gaps in spectrum can be effectively compensated by adding a certain amount of noise to the distorted signal. We provide theoretical background for this approach where we show that the added noise affects mainly the spectral valleys. They are filled by the noise while the spectral bins with speech remain almost unchanged. This helps to restore a more natural shape of log spectrum and cepstrum, and consequently has a positive impact on ASR performance. In our previous work, we have proposed two types of the signal dithering (noise addition) technique, one applied globally, the other in a more selective way. In this paper, we offer a more detailed insight into their performance. We provide results from many experiments where we test them in various scenarios, using a large vocabulary continuous speech recognition (LVCSR) system, acoustic models based on gaussian-mixture model (GMM) as well as on deep-neural network (DNN), and multiple speech databases in three languages (Czech, English and German). Our results prove that both the proposed techniques, and the selective dithering method, in particular, yield consistent compensation of the negative impact of the MP3 compressed speech on ASR performance.

Improving of LVCSR for Casual Czech Using Publicly Available Language Resources

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech and Computer. Heidelberg: Springer, 2017. p. 427-437. Lecture Notes in Artificial Intelligence. vol. LNAI 10458. ISSN 0302-9743. ISBN 978-3-319-66428-6.
  • Year: 2017
  • DOI: 10.1007/978-3-319-66429-3_42
  • Link: https://doi.org/10.1007/978-3-319-66429-3_42
  • Department: Department of Circuit Theory
  • Annotation:
    The paper presents the design of Czech casual speech recognition which is a part of the wider research focused on understanding very informal speaking styles. The study was carried out using the NCCCz corpus and the contributions of optimized acoustic and language models as well as pronunciation lexicon optimization were analyzed. Special attention was paid to the impact of publicly available corpora suitable for language model (LM) creation. Our final DNN-HMM system achieved in the task of casual speech recognition WER of 30-60% depending on LM used. The results of recognition for other speaking styles are presented as well for the comparison purposes. The system was built using KALDI toolkit and created recipes are available for the research community.

KALDI Recipes for the Czech Speech Recognition Under Various Conditions

  • Authors: Mizera, P., Fiala, J., Brich, A., doc. Ing. Petr Pollák, CSc.,
  • Publication: Text, Speech, and Dialogue. 19th International Conference, TSD 2016. Heidelberg: University of Heidelberg, 2016. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-45510-5.
  • Year: 2016
  • DOI: 10.1007/978-3-319-45510-5_45
  • Link: https://doi.org/10.1007/978-3-319-45510-5_45
  • Department: Department of Circuit Theory
  • Annotation:
    The paper presents the implementation of Czech ASR system under various conditions using KALDI speech recognition toolkit in two standard state-of-the-art architectures (GMM-HMM and DNN-HMM). We present the recipes for the building of LVCSR using SpeechDat, SPEECON, CZKCC, and NCCCz corpora with the new update of feature extraction tool CtuCopy which supports currently KALDI format. All presented recipes same as CtuCopy tool are publicly available under the Apache license v2.0. Finally, an extension of KALDI toolkit which supports the running of described LVCSR recipes on MetaCentrum computing facilities (Czech National Grid Infrastructure operated by CESNET) is described. In the experimental part the baseline performance of both GMM-HMM and DNN-HMM LVCSR systems applied on given Czech corpora is presented. These results also demonstrate the behaviour of designed LVCSR under various acoustic conditions same as various speaking styles.

Advanced Acoustic Modelling Techniques in MP3 Speech Recognition

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc., Mizera, P.
  • Publication: EURASIP Journal on Audio Speech and Music Processing. 2015, 2015:20 ISSN 1687-4722.
  • Year: 2015
  • DOI: 10.1186/s13636-015-0064-7
  • Link: https://doi.org/10.1186/s13636-015-0064-7
  • Department: Department of Circuit Theory
  • Annotation:
    The automatic recognition of MP3 compressed speech presents a challenge to the current systems due to the lossy nature of compression which causes irreversible degradation of the speech wave. This article evaluates the performance of a recognition system optimized for MP3 compressed speech with current state-of-the-art acoustic modelling techniques and one specific front-end compensation method. The article concentrates on acoustic model adaptation, discriminative training and additional dithering as a prominent means of compensating for the described distortion in the task of phoneme and large vocabulary continuous speech recognition (LVCSR). The experiments presented on the phoneme task show a dramatic increase of the recognition error for unvoiced speech units as a direct result of compression. The application of acoustic model adaptation has proved to yield the highest relative contribution while the gain of discriminative training diminished with decreasing bit-rate. The application of additional dithering yielded a consistent improvement only for the MFCC features, but the overall results were still worse than those for the PLP features.

Analysis and automatic recognition of compressed speech

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: Tackling the Complexity in Speech. Praha: Filozofická fakulta Univerzity Karlovy v Praze, 2015. p. 205-221. Opea Facultatis philosophicae Universitatis Carolinae Pragensis. vol. 14. ISBN 978-80-7308-558-2.
  • Year: 2015
  • Department: Department of Circuit Theory
  • Annotation:
    The deployment of automatic speech recognition (ASR) systems into real-life are often met with difficulties of diverse acoustic conditions. This diversity is what forces the necessity to build the systems as robust to ensure their reliable performance regardless of the conditions. The usage of MP3 compression represents one of such conditions, when the property of lossy encoding degrades the quality of extracted features and therefore the recognition. The research of optimized settings for MP3 recognition has been conducted by various authors and different solutions have been proposed. This work presents the analysis of optimized setup which was focused on blocks of feature extraction and acoustic modeling. The work summarizes the effects of methods proposed the author and other authors, all tested to determine the potential contribution of each method separately as well as in unison. The main goal of the optimization was to find the proper segmentation, determine the importance of feature normalization and dithering and the application of acoustic model adaptation. The experiments were performed on signals of very good quality which were artificially compressed to simulate the effect of the spectral distortion. The PLP features were extracted and normalized using CMVN and various levels of noise were added. The main purpose was to reduce the effect of spectral distortion brought by compression. The context dependent AMs were trained for RAW data and 160kbit, 32kbit, 24kbit, 16kbit compression speeds. The final AMs were adapted by CMLLR and MAP techniques. The goal of adaptation was to further improve the AM quality and to test the model interchangeability. The recognition was done on LVCSR task of 1 hour with trigram LM.

Improved Estimation of Articulatory Features Based on Acoustic Features with Temporal Context

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Text, Speech, and Dialogue. 18th International Conference, TSD 2015. Heidelberg: Springer, 2015. pp. 560-568. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-24032-9.
  • Year: 2015
  • DOI: 10.1007/978-3-319-24033-6_63
  • Link: https://doi.org/10.1007/978-3-319-24033-6_63
  • Department: Department of Circuit Theory
  • Annotation:
    The paper deals with neural network-based estimation of articulatory features for Czech which are intended to be applied within automatic phonetic segmentation or automatic speech recognition. In our current approach we use the multi-layer perceptron networks to extract the articulatory features on the basis of non-linear mapping from standard acoustic features extracted from speech signal. The suitability of various acoustic features and the optimum length of temporal context at the input of used network were analysed. The temporal context is represented by a context window created from the stacked feature vectors. The optimum length of the temporal contextual information was analysed and identified for the context window in the range from 9 to 21 frames.We obtained 90.5% frame level accuracy on average across all the articulatory feature classes for mellog filter-bank features. The highest classification rate of 95.3% was achieved for the voicing class.

Phonetic Segmentation Using KALDI and Reduced Pronunciation Detection in Causal Czech Speech

  • Authors: Patč, Z., Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Text, Speech, and Dialogue. 18th International Conference, TSD 2015. Heidelberg: Springer, 2015. p. 433-441. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-24032-9.
  • Year: 2015
  • DOI: 10.1007/978-3-319-24033-6_49
  • Link: https://doi.org/10.1007/978-3-319-24033-6_49
  • Department: Department of Circuit Theory
  • Annotation:
    The paper describes the implementation of phonetic segmentation using the tools from KALDI toolkit. Its usage is motivated by the big development and support of topical techniques of ASR which are available in KALDI. The presented work is related to the research on pronunciation variability in casual Czech speech. For this purpose we use the automatic phonetic segmentation to analyze the particular phone boundaries, deletions, etc. We also present the tool for pronunciation detection. Both tools can be used for processing large databases as well as for an interactive work within the environment of Praat. Also the illustrative analysis of the segmentation accuracy and the design of new environment for phonetic segmentation in Praat are presented.

Spectrally Selective Dithering for Distorted Speech Recognition

  • Authors: Borský, M., Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: INTERSPEECH 2015. Bochum: ISCA - International Speech Communication Association, 2015. ISSN 2308-457X.
  • Year: 2015
  • Department: Department of Circuit Theory
  • Annotation:
    The performance of speech recognition systems can be significantly degraded if the speech spectrum is distorted. This includes situations such as the usage of an improper recording device, enhancement technique or speech coder. This paper presents a front-end compensation method called spectrally selective dithering aimed at reconstructing the spectral characteristics of nonlinearly distorted speech. The technique is designed to detect the suppressed frequency bands in the speech signal and add a weighted amount of additive noise. The detection algorithm is based on the smoothness of the excitation signal spectrum obtained through analyzing LPC filtration. The gain of the added noise is estimated from the unaffected frequency bands. The practical usability of the algorithm has been studied in the task of MP3 speech recognition for very low bit-rates. The obtained results have demonstrated the advantage of using the proposed technique. We achieved up to 1.85% absolute WER reduction using the standard HMM-GMM architecture in LVCSR task.

Estimation of Articulatory Features for Czech Language

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: 22nd Czech-German Workshop on Speech Communication. Book of Abstracts. 2014. pp. 25-26.
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    The issues of automatic speech recognition (ASR) aimed at the Czech language have been intensively studied in the past decades. The researches have successfully managed to develop several practical applications such as dictation programs, automatic broadcast transcription (subtitling) and others. Accuracy of these ASR systems is generally satisfactory high, however it is significantly lower if the signal is corrupted, e.g. in the case of high-level background noise, spontaneous speech or when speech is masked and pronounced in a reduced form. These issues are still an obstacle for a wider usage of voice recognition technology under such conditions, because commonly achieved WER (Word Error Rate) of spontaneous speech recognition is above 50% in average. A possible solution to overcome this deficiency can be in the usage of speech production knowledge within ASR systems. Consequently, the speech production knowledge based on articulatory features (AFs) starts being used more often at feature level with the main purpose of improving the recognition of spontaneous or casual speech. The aim of our research is to analyse the possible contribution of articulatory features to the description of spontaneous or casual speech aimed for the Czech language.

Impact of Irregular Pronunciation on Phonetic Segmentation of Nijmegen Corpus of Casual Czech

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc., Kolman, A., Ernestus, M.
  • Publication: Text, Speech, and Dialogue. 17th International Conference, TSD 2014. Heidelberg: Springer, 2014. pp. 499-507. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-10815-5.
  • Year: 2014
  • DOI: 10.1007/978-3-319-10816-2_60
  • Link: https://doi.org/10.1007/978-3-319-10816-2_60
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes the pilot study of phonetic segmentation applied to Nijmegen Corpus of Casual Czech (NCCCz). This corpus contains informal speech of strong spontaneous nature which influences the character of produced speech at various levels. This work is the part of wider research related to the analysis of pronunciation reduction in such informal speech. We present the analysis of the accuracy of phonetic segmentation when canonical or reduced pronunciation is used. The achieved accuracy of realized phonetic segmentation provides information about general accuracy of proper acoustic modelling which is supposed to be applied in spontaneous speech recognition. As a byproduct of presented spontaneous speech segmentation, this paper also describes the created lexicon with canonical pronunciations of words in NCCCz, a tool supporting pronunciation check of lexicon items, and finally also a minidatabase of selected utterances from NCCCz manually labelled on phonetic level suitable for evaluation purposes.

Recognition of Spectrally Distorted Speech after MP3 Compression

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: 22nd Czech-German Workshop on Speech Communication. Book of Abstracts. 2014. pp. 3-4.
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    The deployment of automatic speech recognition (ASR) systems into real-life are often met with difficulties of diverse acoustic conditions. This diversity is what forces the necessity to build the systems as robust to ensure their reliable performance regardless of the conditions. The usage of MP3 compression represents one of such conditions, when the property of lossy encoding degrades the quality of extracted features and therefore the recognition. The research of optimized settings for MP3 recognition has been conducted by various authors and different solutions have been proposed. This work presents the analysis of optimized setup which was focused on blocks of feature extraction and acoustic modeling. The work summarizes the effects of methods proposed the author and other authors, all tested to determine the potential contribution of each method separately as well as in unison.

Robust Neural Network-Based Estimation of Articulatory Features for Czech

  • DOI: 10.14311/NNW.2014.24.027
  • Link: https://doi.org/10.14311/NNW.2014.24.027
  • Department: Department of Circuit Theory
  • Annotation:
    The article describes a neural network-based articulatory feature (AF) estimation for the Czech speech. First, the relationship between AFs and a Czech phone inventory is defined, and then the estimation based on the MLP neural networks is done. The usage of several speech representations on the input of the MLP classifiers is proposed with the purpose to obtain a robust AF estimation. The realized experiments have proved that an ANN- based AF estimation works very reliably especially in a low noise environment. Moreover, in case the number of neurons in a hidden layer is increased and if the temporal context DCT-TRAP features are used on the input of the MLP network, the AF classification works accurately also for the signals collected in the environments with a high background noise.

Speech reduction in Czech

  • Authors: Kolman, A., doc. Ing. Petr Pollák, CSc.,
  • Publication: LabPhone 14. The 14th Conference on Laboratory Phonology. Tokyo: National Institute for Japanese Linguistics in Tokyo, 2014, Available from: http://www.ninjal.ac.jp/labphon14/LP14_FINAL_20140708.pdf
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    The present study contributes to this research by investigating speech reduction in Czech. Our study is based on the Nijmegen Corpus of Casual Czech, which was recorded in Prague in November 2008, and consists of 39 hours of casual conversations between 26 groups of three friends. We studied speech reduction in this corpus by focusing on a number of frequent words and frequent phoneme sequences. First, we see patterns that have also been observed in other languages. Second, Czech also shows clear effects of morphology, which has not been attested for other languages so far. A third interesting topic concerns syllabic consonants and we find that the segment's probability to be absent is modulated by the complexity of the resulting consonant cluster. Our study of Czech clearly shows that it is worthwhile to extend the study of reduction to typologically different languages.

The Nijmegen Corpus of Casual Czech

  • Authors: Ernestus, M., Kockova-Amortova, L., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of the 9th Language Resources and Evaluation Conference. Paris: ELRA - European Language Resources Association, 2014. ISBN 978-2-9517408-8-4.
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    This article introduces a new speech corpus, the Nijmegen Corpus of Casual Czech (NCCCz), which contains more than 30 hours of high-quality recordings of casual conversations in Common Czech, among ten groups of three male and ten groups of three female friends. All speakers were native speakers of Czech, raised in Prague or in the region of Central Bohemia, and were between 19 and 26 years old. Every group of speakers consisted of one confederate, who was instructed to keep the conversations lively, and two speakers naive to the purposes of the recordings. The naive speakers were engaged in conversations for approximately 90 minutes, while the confederate joined them for approximately the last 72 minutes. The corpus was orthographically annotated by experienced transcribers and this orthographic transcription was aligned with the speech signal. In addition, the conversations were videotaped. This corpus can form the basis for all types of research on casual conversations in Czech, including phonetic research and research on how to improve automatic speech recognition. The corpus will be freely available.

The optimization of PLP feature extraction for LVCSR recognition of MP3 data

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: 19th International Conference on Applied Electronics 2014. Pilsen: University of West Bohemia, 2014. p. 55-58. ISSN 1803-7232. ISBN 978-80-261-0276-2.
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    This paper analyses the contribution of optimized PLP feature extraction setup and application of feature normalization to improve the performance of automatic speech recognition system for data compressed by MP3 algorithm. The experimental study performed on loop-digit recognition and large vocabulary continues speech recognition task showed that proper setup can negate the effect of lower compression rates which can achieve results comparable with higher rates. The second finding is that the normalization techniques contribute significantly to overall performance, specially for shorter windows/shifts and lower compression rates. The acoustic models trained on 160kbits/s, 32kbits/s and 16kbits/s data performed at 34.17%, 41.88% and 36.4% WER respectively on LVCSR task. In comparison the noncompressed acoustic models performed at 28.56% WER.

Accuracy of HMM-Based Phonetic Segmentation Using Monophone or Triphone Acoustic Model

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Applied Electronics - 2013 International Conference on Applied Electronics. Pilsen: University of West Bohemia, 2013. pp. 181-184. ISSN 1803-7232. ISBN 978-80-261-0166-6.
  • Year: 2013
  • Department: Department of Circuit Theory
  • Annotation:
    The paper compares the accuracy of HMM-based automatic phonetic segmentation using various signal representation same as acoustic models of various complexity, i.e. acoustic models of monophones or word-internal triphones with various number of mixtures. The precision of automatic phonetic segmentation was measured on the basis of comparison with manually segmented speech data. The analysis showed that the segmentation with acoustic models of word-internal triphones yielded to a better target accuracy. The best results of automatic phonetic segmentation were attained for acoustic models of word-internal triphones with four mixures. In this case average values of shift of phone boundaries and change of phone length was about 5.9~ms and 0.2~ms .

Noise and Channel Normalized Cepstral Features for Far-Speech Recognition

  • Authors: Borský, M., Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech and Computer. Cham: Springer International Publishing AG, 2013. pp. 241-248. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-01930-7.
  • Year: 2013
  • DOI: 10.1007/978-3-319-01931-4_32
  • Link: https://doi.org/10.1007/978-3-319-01931-4_32
  • Department: Department of Circuit Theory
  • Annotation:
    The paper analyses suitable features for distorted speech recognition. The aim is to explore the application of command ASR system when the speech is recorded with far-distance microphones with a possible strong additive and convolutory noise. The paper analyses feasible contribution of basic spectral subtraction coupled with cepstral mean normalization in minimizing of the influence of present distortion in such far-talk channel. The results are compared with reference close-talk speech recognition system. The results show the improvement in WER for channels with low or medium SNR. Using the combination of these basic techniques WERR of 55.6% was obtained for medium distance channel and WERR of 22.5% for far distance channel.

Optimized State-Tying for Triphone-Based HMMs under Training Data Deficiency

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: Applied Electronics - 2013 International Conference on Applied Electronics. Pilsen: University of West Bohemia, 2013. pp. 45-48. ISSN 1803-7232. ISBN 978-80-261-0166-6.
  • Year: 2013
  • Department: Department of Circuit Theory
  • Annotation:
    This paper deals with an optimization of state-tying for triphone-based HMM in the case of training data deficiency. The main goal is to analyse the importance of stopping threshold for criterial function in tree-based clustering. The log-likelihood measure was used as the criterial function, when a varying threshold with different sizes of training set was evaluated. Tied- state triphone HMMs with multiple Gaussian mixtures were trained under various setups. Realized experiments showed that the more complex AMs with less mixtures added could achieve better results that less complex models with more mixtures. The same conclusion was proved for even significantly reduced amount of training data.

Various Approaches of Small Vocabulary Speech Recognizer Implementation Using HTK Toolkit

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: POSTER 2013 - 17th International Student Conference on Electrical Engineering. Prague: Czech Technical University, 2013. pp. 1-5. ISBN 978-80-01-05242-6.
  • Year: 2013
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the construction of the small vocabulary recognizer using publicly available the HTK toolkit. There are available two decoders, HVite and HDecoder, for which the aprroaches of recognizer creation are described com- monly with the description of proper acoustic model cre- ation because slightly different kinds of subword acoustic models are required by these two tools. In the experimental part, both decoders were evaluated on thebasis of loop-digit recognition task with word and cross-word triphone based AMs. The computational costs of described approaches are compared as well.

ANALYSIS OF ACOUSTIC ECHO CANCELLATION AND DOUBLE TALK DETECTION IN SMARTPHONE DEVICES

  • Authors: Klapuch, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: 20th Annual Conference Proceeding's Technical Computing Bratislava 2012. Praha: Humusoft, 2012. pp. 1-8. ISBN 978-80-970519-4-5.
  • Year: 2012
  • Department: Department of Circuit Theory
  • Annotation:
    The article was concerned with the testing of algorithms for acoustic echo suppression and double talk detection in smartphones environment. Acoustic echo with the environmental noise is generally very uncomfortable in the telecommunication and especially in mobile phones. The design of current mobile phones and their usage cause a high level of echo and distortion. Main task was to create a suitable detector DTD, which is an important feature in echo suppression. Detector was analyzed based on the average coherence of speaker and microphone signals with cepstral speech detection in both channels. For this and other purposes was created GUI applications in MATLAB, It enables analysis and testing algorithms. Simulations were performed on newly created speech database, which represent mobile communication. DTD evaluation was carried out as statistically erroneous detection rate in the entire signal and separately as a measure of false detection in critical segments of the signal, which are important for the proper running of AEC.

Estimation of fundamental frequency with localization of pitch marks and pitch-synchronous segmentation

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: 20th Annual Conference Proceeding's Technical Computing Bratislava 2012. Praha: Humusoft, 2012. pp. 1-8. ISBN 978-80-970519-4-5.
  • Year: 2012
  • Department: Department of Circuit Theory
  • Annotation:
    This paper analyses the results of standard algorithms for fundamental frequency estimation and their influence on the accuracy of localization PDA pitch marks in the PMA. The analysis is focused on algorithms based on the autocorrelation function (ACF), difference function (AMDF) and cepstrum. The article presents a procedure for pitch-synchronous segmentation and subsequent resynthesis of segments allowing changes in prosodic features.

Knowledge-Based and Automated Clustering in MLLR Adaptation of Acoustic Models for LVCSR

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: 2012 International Conference on Applied Electronics. Pilsen: University of West Bohemia, 2012. pp. 33-36. ISSN 1803-7232. ISBN 978-80-261-0038-6.
  • Year: 2012
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes the analysis of the performance of MLLR-based speaker adaptation in a large vocabulary continuous speech recognition system. Two different approaches of clustering in MLLR-adaptation with more regression classes, knowledge-based clustering and automatic clustering were analysed. The contribution of mentioned acoustic model adaptation using these two clustering approaches were compared based on the word error rate ratio (WERR) of target LVCSR. Realized study proved that the knowledge-based clustering may bring improvement comparable to the tree-based clustering, when only a few transformation classes are manually defined.

Small and Large Vocabulary Speech Recognition of MP3 Data under Real-Word Conditions: Experimental Study

  • Authors: doc. Ing. Petr Pollák, CSc., Borský, M.
  • Publication: Communications in Computer and Information Science. 2012, 314 409-419. ISSN 1865-0929.
  • Year: 2012
  • DOI: 10.1007/978-3-642-35755-8_29
  • Link: https://doi.org/10.1007/978-3-642-35755-8_29
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the study of speech recognition accuracy both for small and large vocabulary task with respect to different levels of MP3 compression of processed data. The motivation behind the work was to evaluate the usage of ASR system for off-line automatic transcription of recordings collected from standard present MP3 devices under different levels of background noise and channel distortion. Although MP3 may not be an optimal compression algorithm, the performed experiments have prooved that it does not distort speech signal significantly for higher compression rates. Realized experiments showed also that the accuracy of speech recognition (both small- and large-vocabulary) decreased very slowly for the bit-rate of 24 kbps and higher. However, slightly different setup of speech feature computation is necessary for MP3 speech data, mainly PLP features give significantly better results in comparison to MFCC.

Accuracy of MP3 Speech Recognition Under Real-World Conditions. Experimental Study

  • Authors: doc. Ing. Petr Pollák, CSc., Běhunek, M.
  • Publication: Proceedings of SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications.. Sevilla: University of Seville, 2011. pp. 5-10. ISBN 978-989-8425-72-0.
  • Year: 2011
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the study of speech recognition accuracy with respect to different levels of MP3 compression. Special attention is focused on the processing of speech signals with different quality, i.e. with different level of background noise and channel distortion. The work was motivated by possible usage of ASR for offline automatic transcription of audio recordings collected by standard wide-spread MP3 devices. The realized experiments have proved that although MP3 format does not distort speech significantly especially for high or moderate bit rates and high quality of source data. The accuracy of connected digits ASR decreased very slowly up to the bit rate 24 kbps. For the best case of PLP parameterization in close-talk channel just 3% decrease of recognition accuracy was observed while the size of the compressed file was approximately 10% of the original size. All results were slightly worse under presence of additive background noise and channel distortion.

ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness

  • Department: Department of Circuit Theory
  • Annotation:
    This paper deals with the analysis of Automatic Speech Recognition (ASR) suitable for usage within noisy environment and suggests optimum configuration under various noisy conditions. The behavior of standard parameterization techniques was analyzed from the viewpoint of robustness against background noise. It was done for Mel-frequency cepstral coefficients (MFCC), Perceptual linear predictive (PLP) coefficients, and their modified forms combining main blocks of PLP and MFCC. The second part is devoted to the analysis and contribution of modified techniques containing frequency-domain noise suppression and voice activity detection. The above-mentioned techniques were tested with signals in real noisy environment within Czech digit recognition task and AURORA databases. Finally, the contribution of special VAD selective training and MLLR adaptation of acoustic models were studied for various signal features.

Coverage of Spontaneous Conversational Speech from Nijmegen Corpus of Casual Czech by General ASR Language Models

  • Authors: Procházka, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Workshop Production and Comprehension of Conversational Speech. Radboud University Nijmegen, 2011. pp. 34-35.
  • Year: 2011
  • Department: Department of Circuit Theory
  • Annotation:
    The Large Vocabulary Continuous Speech Recognition (LVCSR) as one of the frequent applications of speech technology is being applied nowadays in growing number of applications in everyday human life. Consequently, also the need of spontaneous speech recognition arises, however, such speech has strongly different character in comparison to non-spontaneous speech. Then such specific phenomena are not supposed to be covered by standard general Language Model (LM). In this contribution we will analyze Nijmegen Corpus of Causal Czech (NCCCz) from the point of view of several LMs which are publicly available. We will analyze the rate of Out-Of-Vocabulary (OOV) words, the rate of word fractions, repetitions, or repeated starts, the perplexity computed at text level above transcription of NCCCz, LVCSR performance above recordings using above mentioned LMs.

Performance of Czech Speech Recognition with Language Models Created from Public Resources

  • Authors: Procházka, V., doc. Ing. Petr Pollák, CSc., Žďánský, J., Nouza, J.
  • Publication: Radioengineering. 2011, 40(4), 1002-1008. ISSN 1210-2512.
  • Year: 2011
  • Department: Department of Circuit Theory
  • Annotation:
    In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data

  • Authors: Procházka, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Lecture Notes in Artificial Intelligence. 2010, 6231(2010933819), 181-188. ISSN 0302-9743.
  • Year: 2010
  • DOI: 10.1007/978-3-642-15760-8_24
  • Link: https://doi.org/10.1007/978-3-642-15760-8_24
  • Department: Department of Circuit Theory
  • Annotation:
    In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database.

Creation of Czech continuous speech recognizer using HTK Toolkit

  • Authors: Rajnoha, J., Procházka, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Akustické listy. 2010, 16(1), 5-10. ISSN 1212-4702.
  • Year: 2010
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes LVCSR construction based on basic tools from HMM Toolkit (HTK) for Czech language. Standard design of LVCSR explained in HTK documentation is supplemented by Czech language specific peculiarities. The paper brings the overview of particular steps required for the creation of a system which can be used as a first step in LVCSR research. Although it is not an optimal solution, especially from the point of view of achieved speed and accuracy, the usage of HTK tools provides high flexibility in the testing of different modifications of particular LVCSR modules. The paper also describes the training of context-dependent cross-word triphone HMMs and statistical language model generation with possible optimization of its performance. Finally, the experiments on parameter setting for recognition time and accuracy balance are presented. Proposed system gives currently real-time factor between 1.5 and 2 with acceptable accuracy for medium-sized vocabulary recognition task.

HMM and GMM Based Voice Activity Detectors

  • Department: Department of Circuit Theory
  • Annotation:
    This article describes several solutions of voice activity detection which represents an important subpart of more general research in the field of speech processing and which is a subject of many contemporary research activities and many applications of speech technology. The approaches based on Gaussian mixture models and hidden Markov models are presented in this article, commonly with the study of using different speech parametrizations in GMM and HMM based VADs. Presented detectors were compared with referential heuristic algorithms based on energy and cepstral analysis, and with the VAD accoding to ITU-T G.729 recommendation. The testing of suggested algorithms was realized using the data from CZKCC signal database recorded in running car and the contribution of proposed statistical detectors based on GMM and HMM is evident, especially, for speech signals collected in very noisy environment.

Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices

  • Authors: doc. Ing. Petr Pollák, CSc., Rajnoha, J.
  • Publication: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Paris: ELRA, 2010. ISBN 2-9517408-6-7.
  • Year: 2010
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes Czech spontaneous speech database of lectures collected at Czech Technical University in Prague, commonly with the procedure of its recording and annotation. In this article, special attention is paid to the description of time synchronizations of signals recorded by two independent devices. This synchronization is based on cross-correlation analysis with simple automated selection of suitable short signal subparts. The database contains 21.7 hours of speech material recorded in 4 channels with 3 principally different microphones. The annotation of the database is composed from basic time segmentation, orthographic transcription, pronunciation lexicon, session and speaker information, and the documentation. The collection and annotation of this database is complete and its availability via ELRA is currently under preparation.

Preparation and analysis of Czech Web 1T 5-gram corpus for language model creation

  • Authors: Procházka, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Analýza a zpracování řečových a biologických signálů - sborník prací 2010. Praha: České vysoké učení technické v Praze, 2010. pp. 67-73. ISBN 978-80-01-04680-7.
  • Year: 2010
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes an approach to analysis of Czech Web 1T 5-gram corpus. This corpus was analyzed and its basic characteristics were evaluated. Various filtering methods were used during processing, so only meaningful words are included in vocabulary. From this cleaned corpus language models for Large Vocabulary Continuous Speech Recognition (LVCSR) were created and theirs perplexities were counted. For comparison, same filtering methods were used for processing 5-gram corpus based on SYN2006PUB corpus, assembled by Czech National Corpus (CNC).

Accuracy Analysis of Generalized Pronunciation Variant Selection in ASR Systems

  • Authors: Hanžl, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Lecture Notes in Artificial Intelligence. 2009, 5641(2009931057), 399-408. ISSN 0302-9743.
  • Year: 2009
  • DOI: 10.1007/978-3-642-03320-9_37
  • Link: https://doi.org/10.1007/978-3-642-03320-9_37
  • Department: Department of Circuit Theory
  • Annotation:
    Automated speech recognition (ASR) systems work typically with pronunciation dictionary for generating expected phonetic content of particular words in recognized utterance. But the pronunciation can vary in many situations. Besides the cases with more possible pronunciation variants specified manually in the dictionary there are typically many other possible changes in the pronunciation depending on word context or speaking style, very typical for our case of Czech language. In this paper we have studied the accuracy of proper selection of automatically predicted pronunciation variants in Czech HMM ASR based systems. We have analyzed correctness of pronunciation variant selection in forced alignment of known utterances. Using the proper pronunciation variant were created mainly for the more accurate training of acoustic HMM models. Finally, the accuracy of LVCSR results using different levels of automated pronunciation generation were tested.

Czech Spontaneous Speech Collection and Annotation: The Database of Technical Lectures

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Lecture Notes in Artificial Intelligence. 2009, 5641(2009931057), 377-385. ISSN 0302-9743.
  • Year: 2009
  • DOI: 10.1007/978-3-642-03320-9_35
  • Link: https://doi.org/10.1007/978-3-642-03320-9_35
  • Department: Department of Circuit Theory
  • Annotation:
    Applying speech recognition into real working systems, spontaneous speech recognition has increasing importance. So the need of spontaneous speech database is evident and this paper describes the collection of Czech spontaneous data recorded within technical lectures. It should be used as a material for the analysis of particular phenomena which appear within spontaneous speech but also as an extension material for training of spontaneous speech recognizers. Speech signals are captured in two different channels with slightly different quality and about 14 hours of speech from 15 different speakers are currently collected and annotated. The first analyses of spontaneous speech related effects in the collected data have been performed and the comparison with read speech databases is presented.

Design and Utilization of Testing Database for VAD Classification

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: 19th Czech-German Workshop on Speech Processing. Prague: Institute of Photonics and Electronics AS CR, 2009. pp. 42-47. ISBN 978-80-86269-18-4.
  • Year: 2009
  • Department: Department of Circuit Theory
  • Annotation:
    Voice activity detection is one of the important issues addressed in the current speech processing research. Different VADs are under development by many authors and the need for their objective comparison and evaluation occurres. This article presents the design of VAD testing database together with the description of criteria describing numerically the accuracy of VAD. Recordings used for this database were selected as subsets of CZKCC, CAR2ECS and SPEECON databases. As only transcriptions without time boundaries are in source databases, these boundaries had to be added either manually or using HMMbased forced alignment. Our selections consists of different kind of speech utterances like isolated digits, short commands, names, phonetically rich sentences etc. We have selected also signals recorded in different environments.

Long Recording Segmentation Based on Simple Power Voice Activity Detection with Adaptive Threshold and Post-Processing

  • Authors: doc. Ing. Petr Pollák, CSc., Rajnoha, J.
  • Publication: SPECOM 2009 Proceedings. St. Petersburg: Institute for Informatics and Automation of RAS (SPIIRAS), 2009. pp. 55-60. ISBN 978-5-8088-0442-5.
  • Year: 2009
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes the method of long recording segmentation based on Voice Activity Detection (VAD). Power based detection using an adaptive threshold derived from power dynamics is the core of presented approach. Simple post-processing based on long time sub-segmentation is used for smoothing of primary VAD output to obtain target start-point and end-point detection of particular utterances within long recordings. Because the algorithm is based on simple power VAD it can be much more easily implemented in comparison to approaches based on speech recognition. Though presented approach is so simple it gives quite robust and satisfactory results for pure segmentation task. The tests with two different data types proved satisfactory results same as practical usage during the creation of new speech corpora.

Robust Speech Recognition in Car Environment Combining Noise Reduction and Acoustic Model Adaptation

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: 19th Czech-German Workshop on Speech Processing. Prague: Institute of Photonics and Electronics AS CR, 2009. pp. 27-34. ISBN 978-80-86269-18-4.
  • Year: 2009
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the study of proper front-end signal processing in combination with model adaptation focused on ASR application in car environment. We present firstly the application of our noise suppression technique within standard feature extraction and analysis of the results achieved with such features under different car conditions. Proposed technique significantly reduces the influence of noisy background. Quite low WER for digit recognition task can be then achieved for noisy data especially in close-talk channel. Secondly, further improvement is reached by the application of MLLR adaptation technique studied mainly with respect to the adaptation to continuously changing background in car environment.

The Dynamic Dimension of the Global Speech-Rhythm Attributes

  • Authors: Volín, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of Interspeech 2009. Grenoble: International Speech Communication Association, 2009. pp. 1543-1546. ISSN 1990-9772. ISBN 978-1-61567-692-7.
  • Year: 2009
  • Department: Department of Circuit Theory
  • Annotation:
    Recent years have revealed that certain global attributes of speech rhythm can be quite successfully captured with respect to consonantal and vocalic intervals in spoken texts. One of the problems of this approach lies in complex syllabic structures. Unless we make an a-priori phonological decision, sonorous consonants may contribute to either vocalic or consonantal part of the speech signal in post-initial and prefinal positions of syllabic onsets and codas. A procedure is offered to avoid phonological dilemmas together with tedious manual work. The method is tested on continuous Czech and English texts read out by several professionals.

HMM and EHMM Based Voice Activity Detectors and Design of Testing Platform for VAD Classification

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Digital Technologies 2008. Žilina: Žilinská universita, Elektrotechnická fakulta, 2008. pp. 1-4. ISBN 978-80-8070-953-2.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    The usage of LR and ergodic Markov models in voice activity detection and VAD testing platform is presented in this article. These detectors based on HMMs and EHMMs reach better results than traditional energy or cepstral detectors. The testing of suggested algorithms were realized with data recorded in running car and the contribution is evident especially in this very noisy environment. Commonly with the results of experiment the selection of the data and the design of the VAD testing platform is described in this paper. Used speech records consists of isolated digits, different commands, names and were recorded in environment of quiet car without engine, running car or standing car with running engine.

Phone Segmentation Tool with Integrated Pronunciation Lexicon and Czech Phonetically Labelled Reference Database

  • Authors: doc. Ing. Petr Pollák, CSc., Volín, J., Skarnitzl, R.
  • Publication: 6th International Conference on Language Resources and Evaluation. Paris: ELRA - European Language Resources Association, 2008. p. 1-5. ISBN 2-9517408-4-0.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    Phonetic segmentation is the procedure which is used in many applications of speech processing, both as a subpart of automated systems or as the tool for an interactive work. In this paper we are presenting the latest development in our tool of automated phonetic segmentation. The tool is based on HMM forced alignment realized by publicly available HTK toolkit. It is implemented into the environment of Praat application and it can be used with several optional settings. The tool is designed for segmentation of the utterances with known orthographic records while phonetic contents are obtained from the pronunciation lexicon or from orthoepic record generated by rules for new unknown words. Second part of this paper describes small Czech reference database precisely labelled on phonetic level which is supposed to be used for the analysis of the accuracy of automatic phonetic segmentation.

Problems and Solutions in the Creation of Czech and Slovak Lexica for Speech Technology Applications: General Experiences and LC-Star2 Lexica

  • Authors: doc. Ing. Petr Pollák, CSc., Hanžl, V., Černocký, J., Smrž, P.
  • Publication: Digital Technologies 2008. Žilina: Žilinská universita, Elektrotechnická fakulta, 2008. pp. 1-5. ISBN 978-80-8070-953-2.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents results of interdisciplinary research which is devoted to design and collection of lexica for speech technology applications. Such lexica are required by automated speech recognizers (ASR), text-to-speech synthesis systems (TTS), or translation systems. For the design and creation of such lexica, linguistics or phonetics solutions are sometimes constrained by the nature of ASR or TTS systems. Within this paper, we would like to present our general experiences in this field and also some experiences from creation of Czech and Slovak LC-Star2 Lexica.

Speaker Non-Speech Event Modelling in Recognition of Read and Spontaneous Speech

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Digital Technologies 2008. Žilina: Žilinská universita, Elektrotechnická fakulta, 2008. pp. 1-6. ISBN 978-80-8070-953-2.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    Modelling of non-speech events brings the necessary robustness in the recognition of natural or spontaneous utterances which are usually full of such acoustic disfluencies. This paper presents the solution of speaker non-speech event modelling commonly with the analyzes how efficiently these events are modelled. Firstly, the procedure for efficient training of non-speech event models on read speech data is presented. The results of experiments with simple ASR achieved 26\,\% decrease of word error rate and a~significant decrease of insertion rate with these models. Secondly, the extension of training data with spontaneous speech collection is described. It contributes to the availability of more natural data for training purposes and mainly to the better training of non-speech event models, which is demonstrated by the experiment on filled pause recognition.

Voice Activity Detection Based on Perceptual Cepstral Analysis

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Technical Computing Prague 2008. Praha: Humusoft, 2008. pp. 1-9. ISBN 978-80-7080-692-0.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    This contribution deals with the description and implementation of Voice Activity Detector (VAD) based on perceptual cepstral analysis of speech signal. Cepstral detectors are more robust in noisy enviroment in comparison to simpler algorithms, e.g. energy-based systems. Moreover, perceptual analysis of speech signal extracts the speech features that better describe the signal for the purposes of speech detection by applying filterbank for non-linear frequency scaling. The paper describes particular steps of the detection algorithm together with more detailed description of the most important blocks and their implementation in MATLAB. The work compares the proposed algorithm with standard detection procedure used in voice codec G729. Also possible utilization of the detector based on different algorithms is discussed. Experiments on using proposed VAD algorithms in speech recognition task led to the decrease in recognition error by cca 50%.

Voice Activity Speech Detectors Based on Ergodic Markov Models

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Technical Computing Prague 2008. Praha: Humusoft, 2008. pp. 1-6. ISBN 978-80-7080-692-0.
  • Year: 2008
  • Department: Department of Circuit Theory
  • Annotation:
    The usage of ergodic Markov models in voice activity detection is presented in this article. The classifiers based on thresholding of proper speech characteristics are usually utilized in traditional detectors. The statistical classifier is used in presented paper. The classifier and voice activity detector were designed and tested using CAR2CS database. Detectors based on ergodic hidden Markov models reach better results than traditional detectors. The most important contribution is that the voice activity detection in very noisy environment was improved.

Accuracy analysis of phonetic segmentation with multiple word-pronunciation variants and segmetnation tool in Praat environment

  • Authors: doc. Ing. Petr Pollák, CSc., Volín, J., Skarnitzl, R.
  • Publication: Speech Processing. Prague: Institute of Photonics and Electronics AS CR, 2007. pp. 37-42. ISBN 978-80-86269-00-9.
  • Year: 2007
  • Department: Department of Circuit Theory
  • Annotation:
    The paper describes further activities in the research of automated phonetic segmentation of Czech speech. We are dealing with looking for phone boundaries of known utterances with given orthographic record within this research. Realized work presented in this paper was focused on using more pronunciation variants related to available orthography which should yield to automated determination of real phonetic contents of analyzed utterance followed by phone boundaries setting. We are also presenting the tool for phonetic segmentation realized in Praat environment. The designed tool is now used for automated pre-segmentation before further precise labelling on phonetic level.

Automated Phonetic Segmentation of Speech Based on HMM and its Implementation in Praat Environment

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: Acta Universitatis Carolinae: Philologica. 2007, 48(2), 117-129. ISSN 0323-0767.
  • Year: 2007
  • Department: Department of Circuit Theory
  • Annotation:
    Phonetic segmentation is a task which is required in many applications of current speech technology. Typical solution used for this purpose is based on forced alignment of trained Hidden Markov Models (HMM) of particular phones. This approach is used in the described tool which is constructed on the basis of HTK Toolkit and which is then included into the environment of program Praat. The system is available for public use and it can be used for automated phonetic segmentation of large data sets but also as an interactive tool for a pre-processing before further manual segmentation for the purposes of basic phonetic research. This is also main reason why we started to solve this problem in co-operation with Institute of Phonetics at Charles University in Prague. In the co-operation with the experts of phonetics from the group of prof. Palkova we also analyzed in details the precision of segmentation algorithm, which was approximately 10 ms.

HMM-Based Phonetic Segmentation in Praat Environment

  • Authors: doc. Ing. Petr Pollák, CSc., Volín, J., Skarnitzl, R.
  • Publication: The XII International Conference Speech and Computer - SPECOM 2007. Moscow: Moskovskij gosudarstvennyj universitet im. M. V. Lomonosova, 2007. pp. 537-541. ISBN 6-7452-0110-X.
  • Year: 2007
  • Department: Department of Circuit Theory
  • Annotation:
    Phonetic segmentation is required in many applications of current speech technologies. One of the most frequently used methods is based on forced alignment of trained Hidden Markov models of phones. This approach is used in our phonetic segmentation tool which is constructed on the basis of HTK toolkit and integrated with the Praat environment. The system is currently used for Czech language and the required input is speech of known content, i.e. with its orthographic record. The system creates regular orthoepic transcription which is obtained by conversion rules. Exceptions from regular pronunciation can be marked by simple syntax so that forced alignment is finally provided on real phonetic contents of the utterance. The system is available for public usage.

Modified Feature Extraction Methods in Robust Speech Recognition

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of 17th International Conference Radioelektronika 2007. Piscataway: Institute of Electrical and Electronic Engineers, 2007. pp. 521-524. ISBN 1-4244-0821-0.
  • Year: 2007
  • DOI: 10.1109/RADIOELEK.2007.371488
  • Link: https://doi.org/10.1109/RADIOELEK.2007.371488
  • Department: Department of Circuit Theory
  • Annotation:
    The speech recognisers use a parametric form of the signal to get the most important features in speech for the recognition task. Mel-frequency cepstral coefficients (MFCC) and Perceptual linear prediction coefficients (PLP) belong to the most commonly used methods. There is no rule to decide which one is better to use and it depends mainly on the particular conditions. The tests on taking advantage of different parts of each parametrization process to get the best results in given conditions are presented in this paper. Robust Hidden Markov model-based (HMM) Czech digit recogniser in slightly noisy environment is used for this purpose. The experiments show, that using Bark-frequency scaling, equal loudness pre-emphasis and intensity-loudness power law in the original MFCC method can bring improvement in white noise robustness for particular conditions. The results also uncovered that the LP-based methods tend to generate insertion errors in given environment.

Technology of Speech Communication

  • Department: Department of Circuit Theory
  • Annotation:
    The book brings complex information about digital processing of human speech in the fields of transmission, synthesis, and coding. It should serve as studying material for the subject Digital processing of speech signals from master program of structural study and also for the subject Phonetic signals and their coding from doctoral study program. The purpose of the book is also to present general overview of this research field for wide public or to present several research results obtained in this field in the research team at the Department of Circuit Theory at Faculty of Electrical Engineering at CTU in Prague.

Voice Activity Detection in Small Vocabulary Speech Recognition

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech Processing. Prague: Institute of Photonics and Electronics AS CR, 2007. pp. 43-48. ISBN 978-80-86269-00-9.
  • Year: 2007
  • Department: Department of Circuit Theory
  • Annotation:
    Experiments on using voice activity detection (VAD) as a part of the frame dropping method for suppressing the influence of background noise in speech recognition are presented in this work. A speaker independent phoneme-based Czech digit sequence recogniser working in real environment was used for this purpose. A parametrization-based VAD is used here and the results are compared under different conditions - noisy environment, distribution level and auditory-based signal parametrization. The experiments show, that VAD-based frame dropping signal processing can bring the improvement to the recognition in terms of decreasing the insertion error and increasing the speech model preciseness, reaching for up to 20% word error rate enhancement. But the need for the universal setting of the detection algorithm for general environmental conditions brings the detection inaccuracy, which takes effect in the recognition results.

Analysis of Glottal Stop Presence in Large Speech Corpus and Influence of Its Modelling on Segmentation Accuracy

  • Authors: doc. Ing. Petr Pollák, CSc., Volín, J., Skarnitzl, R.
  • Publication: Proceedings of the 16th Czech-German Workshop on Speech Processing. Praha: AV ČR, Ústav radiotechniky a elektroniky, 2006. pp. 98-104. ISBN 80-86269-15-9.
  • Year: 2006
  • Department: Department of Circuit Theory
  • Annotation:
    The research within phonetic segmentation of real fluent Czech speech is presented this work. The main goal of this work was to overcome segmentation in-accuracy due to missing modelling of glottal-stop. HMM model for glottal-stop was trained using iterative procedure because we are working with training data with no manually annotated information about glottal-stop presence in particular utterances. Experiments realized within this work confirmed the improvement of phonetic segmentation with glottal-stop modelling and also performed localization of glottal-stop presence was generally successful. Finally, the experiments with changes in HMM structure were realized, however the results confirm the improvement for several phone contexts only.

Data-Driven Design of Front-End Filter Bank for Lombard Speech Recognition

  • Authors: Bořil, H., Fousek, P., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of Ninth International Conference on Spoken Language Processing. Rundle Mall: CAUSAL Production, 2006. pp. 381-384. ISSN 1990-9772.
  • Year: 2006
  • Department: Department of Circuit Theory
  • Annotation:
    Adverse environments not only corrupt speech signal by additive and convolutional noises, which can be successfully addressed by a number of suppression algorithms, but also affect the way how speech is produced. Speech production variations introduced by a speaker in reaction to a noisy background (Lombard effect) may result in a severe degradation of automatic speech recognition. This paper contributes to the solution of Lombard speech recognition issue by providing a robust filter bank for use in front-ends. It is shown that cepstral features derived from the proposed filter bank significantly outperform conventional cepstral features.

Methodology of Lombard Speech Dabase Acquisition: Experiences with CLSD

  • Authors: Bořil, H., Bořil, T., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of 5th International Conference on Language Resources and Evaluation. Paris: ELRA - European Language Resources Association, 2006. p. 1644-1647. ISBN 2-9517408-2-4.
  • Year: 2006
  • Department: Department of Circuit Theory
  • Annotation:
    Aim of this paper is to describe the hardware platform, scenarios and recording tool used for the acquisition of CLSD?05. A method for minimization of the speech attenuation introduced to the speaker by headphones is proposed in this paper. Finally, contents and corpus of the database are presented to outline its suitability for analysis and modeling of Lombard effect. The whole CLSD?05 database with a detailed documentation is now released for public use.

Modelling of Speaker Non-speech Events in Robust Speech Recognition

  • Authors: Rajnoha, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of the 16th Czech-German Workshop on Speech Processing. Praha: AV ČR, Ústav radiotechniky a elektroniky, 2006. pp. 149-155. ISBN 80-86269-15-9.
  • Year: 2006
  • Department: Department of Circuit Theory
  • Annotation:
    Experiments on modelling of speaker non-speech events (SNE) using robust speech recogniser based on hidden Markov models (HMM) are presented in this work. A speaker independent spoken Czech digits recogniser based on Czech phoneme modelling in real environment was used for this purpose. Only SNEs which are positioned in between words are modelled, as they can be easily added to the recogniser grammar as they were another word. The recognition results were analysed for two different testing datasets, each derived from the training sets (different in environmental conditions). At the end of the experiment the recognition score increased for about 22% and 11% for the used testing datasets against the results reached without modelling the events. The recogniser was also tested on data with unknown recording conditions. Low number of incorrectly inserted word shows that this modelling seem to be less dependent on recording conditions than pure phoneme model case.

Analysis of Lombard Effect in Several Czech Databases

  • Authors: Bořil, H., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop "Speech Processing". Dresden: TU Dresden, 2005. pp. 253-259. ISBN 3-938863-17-X.
  • Year: 2005

Comparison of Three Czech Speech Databases from the Standpoint of Lombard Effect Appearance

  • Authors: Bořil, H., doc. Ing. Petr Pollák, CSc.,
  • Publication: ASIDE 2005 - Applied Spoken Language Interaction in Distributed Environments - Book of Abstracts. Grenoble: International Speech Communication Association, 2005. ISSN 0908-1224. ISBN 87-90834-85-2.
  • Year: 2005

Confronting HMM-based Phone Labelling with Human Evaluation of Speech Production

  • Authors: Volín, J., Skarnitzl, R., doc. Ing. Petr Pollák, CSc.,
  • Publication: Interspeech Lisboa 2005. Grenoble: International Speech Communication Association, 2005. pp. 1541-1544. ISSN 1018-4074.
  • Year: 2005

Design and Collection of Czech Lombard Speech Database

  • Authors: Bořil, H., doc. Ing. Petr Pollák, CSc.,
  • Publication: Interspeech Lisboa 2005. Grenoble: International Speech Communication Association, 2005. pp. 1577-1580. ISSN 1018-4074.
  • Year: 2005

Design of Lombard Effect Speech Database

  • Authors: Bořil, H., Bořil, T., doc. Ing. Petr Pollák, CSc.,
  • Publication: Radioelektronika 2005 - Conference Proceedings. Brno: VUT v Brně, FEI, Ústav radioelektroniky, 2005. pp. 144-147. ISBN 80-214-2904-6.
  • Year: 2005
  • Department: Department of Circuit Theory
  • Annotation:
    Speech recognition efficiency decreases remarkably for speech uttered in adverse conditions. Besides the negative impact of speech signal corruption by ambient noise, Lombard effect (LE) represented by speaker modifications of speech characteristics in an effort to increase communication intelligibility results in significant degradation of clean speech recognizer performance. While a lot of attention has been given to noise suppression in speech signals, LE classification and elimination represents a relatively new task, promising further improvements in natural environment speech recognition accuracy. Some speech databases of Czech language include recordings in noisy conditions, e.g. SPEECON and Temic, but in most cases recorded utterances do not contain LE due to low level of environmental noise and/or lack of speaker effort to react to the actual noise. In this paper, design of LE speech database and recording platform for its collection are presented.

HMM Based VAD Using Token Passing Algorithm and Generalized Speech and Silence Models

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop "Speech Processing". Dresden: TU Dresden, 2005. pp. 316-322. ISBN 3-938863-17-X.
  • Year: 2005

Influence of HMM´s Parameters on the Accuracy of Phone Segmentation - Evaluation Baseline

  • Authors: doc. Ing. Petr Pollák, CSc., Volín, J., Skarnitzl, R.
  • Publication: Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop "Speech Processing". Dresden: TU Dresden, 2005. pp. 302-309. ISBN 3-938863-17-X.
  • Year: 2005

LexEdit: GUI to Czech Pronunciation Lexicon for Speech Recognition Purposes

  • Authors: Brada, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceedings of the 16th Conference Joined with the 15th Czech-German Workshop "Speech Processing". Dresden: TU Dresden, 2005. pp. 260-266. ISBN 3-938863-17-X.
  • Year: 2005

Methods for Speech SNR Estimation: Evaluation Tool and Analysis of VAD Dependency

  • Department: Department of Circuit Theory
  • Annotation:
    The tool can estimate the SNR of noisy speech signal with or without reference signal. The tool can be also used to create a speech and noise mixture with required SNR.

Voice Activity Detector Based on Sample Synchronous Probability Evaluation Using HMM

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Radioelektronika 2005 - Conference Proceedings. Brno: VUT v Brně, FEI, Ústav radioelektroniky, 2005. pp. 440-443. ISBN 80-214-2904-6.
  • Year: 2005

Czech Speech Database for Consumer Devices (SPEECON): Description and Experiences from Collection

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech Processing. Praha: AV ČR, Ústav radiotechniky a elektroniky, 2004. pp. 126-128. ISBN 80-86269-10-8.
  • Year: 2004

Direct Time Domain Fundamental Frequency Estimation of Speech in Noisy Conditions

  • Authors: Bořil, H., doc. Ing. Petr Pollák, CSc.,
  • Publication: EUSIPCO-2004 - Proceedings. Wien: Technische Universität, 2004. pp. 1003-1006. ISBN 3-200-00165-8.
  • Year: 2004

Experiments in Voice Activity Detection Using Hidden Markov Models

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech Processing. Praha: AV ČR, Ústav radiotechniky a elektroniky, 2004. pp. 102-105. ISBN 80-86269-11-6.
  • Year: 2004

Hidden Markov Models in Voice Activity Detection

  • Authors: Tatarinov, J., doc. Ing. Petr Pollák, CSc.,
  • Publication: Robust2004: Robustness Issues in Conversational Interaction. Brussels: COST Office, 2004.
  • Year: 2004

Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment

  • Authors: doc. Ing. Petr Pollák, CSc., Černocký, J.
  • Publication: LREC 2004 - IV. International Conference on Language Resources and Evaluation. Paris: ELRA - European Language Resources Association, 2004. pp. 595-598. ISBN 2-9517408-1-6.
  • Year: 2004

Additive Noise and Channel Distortion-Robust Parameterization Tool - Performance Evaluation on Aurora 2 & 3

Efficient and Reliable Measurement and Simulation of Noisy Speech Background

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: Proceeding of the 11th European Signal Processing Conference. Bretagne: ENST, 2002.
  • Year: 2002

Tool for Czech Pronunciation Generation Combining Fixed Rules with Pronunciation Lexicon and Lexicon Management Tool

  • Authors: doc. Ing. Petr Pollák, CSc., Hanžl, V.
  • Publication: Proceedings of the Third International Conference on Language Resources and Evaluation. Paris: ELRA - European Language Resources Association, 2002. p. 1264-1269. ISBN 2-9517408-0-8.
  • Year: 2002

Czech Pronunciation Lexicon and Annotation of Very Large Databases

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech Processing - 11th Czech-German Workshop. Praha: AV ČR, Ústav radiotechniky a elektroniky, 2001. pp. 44-45. ISBN 80-86269-07-8.
  • Year: 2001

Methods for Estimation of Signal-to-Noise Ratio in Speech

SNR of Noisy Speech and Methods for Its Estimation

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: Polish-Czech-Hungarian Workshop on Circuit Theory, Signal Processing and Telecommunications Network. Budapest: Technical University, 2001. pp. 33-40.
  • Year: 2001

SpeechDat-E: Five Eastern European Speech Databases for Voice-Operated Teleservices Completed

  • Authors: Huevel, H., Boudy, J., Bakcsi, Z., Černocký, J., Galunov, V., Kochanina, J., Majewski, W., doc. Ing. Petr Pollák, CSc., Rusko, M., Sadowski, J., Staroniewicz, P., Tropf, H.
  • Publication: Eurospeech 2001 Scandinavia. Aalborg: Aalborg University, 2001. pp. 2059-2062. ISBN 87-90834-09-7.
  • Year: 2001

SpeechDat(E) - Eastern European Telephone Speech Databases

  • Authors: doc. Ing. Petr Pollák, CSc., Černocký, J., Boudy, J., Choukri, K., van den Heuvel, H., Vicsi, K., Virag, A., Siemund, R., Maiewski, W., Staroniwicz, P., Tropf, H., Kochanina, J., Ostrouchov, A., Rusko, M., Trnka, M.
  • Publication: XLDB - Very Large Telephone Speech Databases. Paris: European Language Recources Association (ELRA), 2000. pp. 20-25.
  • Year: 2000

ASR with Noisy Speech Pre-processing and Phoneme Model Re-estimation

CAR2 - Czech Database of Car Speech

  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents new Czech language twochannel (stereo) speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch reestimation were realised. The noise analysis of the car background environment was done.

Combined Noise Suppresioon System for Monaural Cochlear Implants

Czech Language Database of Car Speech and Environmental Noise

Czech Telephony Speech Database of 1000 Speakers - SpeechDat(E)

  • Authors: doc. Ing. Petr Pollák, CSc., Hanžl, V., Černocký, J.
  • Publication: Polish-Hungarian-Czech Workshop on Circuit Theory, Signal Processing, and Application. Praha: České vysoké učení technické v Praze, 1999. pp. 77-80. ISBN 80-01-02047-9.
  • Year: 1999

Generating Phonetically Rich Sentences and Words for Czech SpeechDat

  • Authors: Hanžl, V., doc. Ing. Petr Pollák, CSc., Černocký, J.
  • Publication: 9th Czech-German Workshop in Speech Processing. Praha: AV ČR, Ústav radiotechniky a elektroniky, 1999. pp. 15-16.
  • Year: 1999

Influence of Parameter Estimation in Kalman Filtering of Speech Signals

Phoneme Model Based ASR of Words in Car Environment

Real-Time Fixed-Point DSP-Implementation of Spectral Substraction Algorithm for Speech Enhancement in Noisy Environment

Recording of Czech and Slovak Telephone Databases within SpeechDat-E

  • Authors: Černocký, J., doc. Ing. Petr Pollák, CSc., Rusko, M., Hanžl, V., Trnka, M.
  • Publication: Proceedings TSD'99. Berlin: Springer, 1999. p. 388-391. ISBN 3-540-66494-7.
  • Year: 1999
  • Department: Department of Circuit Theory
  • Annotation:
    The databases of 5 East-European languages: Czech, Slovak, Russian, Polish and Hungarian are being created within the SpeechDat-E project. This paper describes the overall design of SpeechDat-E databases and concentrates on the Czech (1000 speakers) and Slovak (1000 speakers). The item structure and recording specifications are presented. More detailed description is included for the language-specific items. Attention is paid also to the geographic and dialect distribution of speakers. The paper also presents the recruitment strategy.

Czech Car Noisy Speech Database

Database of Car Speech, Analysis of Collected Data, Tools for Automated Labeling

Experimental Study of Speech Recognition in Noisy Environment

Suppression of Acoustic Noise in Speech Using Kalman Filtering

Study of Speech Recognition in Noisy Environment

  • Department: Department of Circuit Theory
  • Annotation:
    Achieving reliable performance in speech recognition is the car for mobile telephony application has been studying intensively for more than one decade. This paper addresses effects of mismatched conditions and their minimization with respect to the performance of speaker-independent isolated. word recognition in the car-noise environment without consideration of Lombard effect. This study is primarily intended to study the dependence of the recognition rate on the SNR of an input signal without and with noise enhancement preprocessing, especially to find conditions under that the modified spectral subtraction call be effectively used for the speech recognition ill a real non-stationary car-noise environment. If as the worst recognition rate. is admitted e.g. 80%, then tile use of spectral subtraction methods enables to use wider interval of input SNRs: for the trainig: made oil a clean speech this interval is (40,6)(1) dB; for the training made on a noisy speech this interval is (40,-2) dB; for tile training performed oil an enhanced speech this interval is (40,-8) dB. The third case gives tile widest interval of SNRs in which a recogniser (with tile final recognition rate ill tile interval of (100,80)%) can be used.

The Problems of Robust LPC Parametrization

Extended Spectral Subtraction

Extended Spectral Subtraction

Noise Cancellation Systems

Cepstral Speech/Pause Detectors

Implementation of Spectral Subtraction

Speech Detection in the Real Car Environment

Speech/Pause Detection for Real-Time Implementation of Spectral Subtraction Algorithm

The Study of Speech/Pause Detectors for Speech Enhancement Methods

Real-time Noise Suppression System on TMS32OC30

  • Authors: Davídek, V., doc. Ing. Petr Pollák, CSc.,
  • Publication: Speech Processing on 4th Czech-German Workshop. Praha: AV ČR, Ústav radiotechniky a elektroniky, 1994. pp. 10-11. ISBN 80-901658-1-8.
  • Year: 1994

Speech Identification Algorithms in a Noisy Environment

  • Authors: doc. Ing. Petr Pollák, CSc.,
  • Publication: 31st Conference on Acoustics. Praha: České vysoké učení technické v Praze, 1994. pp. 147-151. ISBN 80-01-01146-1.
  • Year: 1994

Speech Recognition Systems

Noise Suppression System for a Car

Noise Suppression System for Speech Degraded in Running Car

Non-musical Tone Spectral Subtraction

One Channel Suppression System for a Car

Responsible person Ing. Mgr. Radovan Suk