Publications

We have published the results of our work in international high-impact scientific journals and at a number of prestigious international congresses and conferences. The most important works in the past period are:

Publications

Robust Neural Network-Based Estimation of Articulatory Features for Czech

  • DOI: 10.14311/NNW.2014.24.027
  • Link: https://doi.org/10.14311/NNW.2014.24.027
  • Department: Department of Circuit Theory
  • Annotation:
    The article describes a neural network-based articulatory feature (AF) estimation for the Czech speech. First, the relationship between AFs and a Czech phone inventory is defined, and then the estimation based on the MLP neural networks is done. The usage of several speech representations on the input of the MLP classifiers is proposed with the purpose to obtain a robust AF estimation. The realized experiments have proved that an ANN- based AF estimation works very reliably especially in a low noise environment. Moreover, in case the number of neurons in a hidden layer is increased and if the temporal context DCT-TRAP features are used on the input of the MLP network, the AF classification works accurately also for the signals collected in the environments with a high background noise.

Performance of Czech Speech Recognition with Language Models Created from Public Resources

  • Authors: Procházka, V., doc. Ing. Petr Pollák, CSc., Žďánský, J., Nouza, J.
  • Publication: Radioengineering. 2011, 40(4), 1002-1008. ISSN 1210-2512.
  • Year: 2011
  • Department: Department of Circuit Theory
  • Annotation:
    In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness

  • Department: Department of Circuit Theory
  • Annotation:
    This paper deals with the analysis of Automatic Speech Recognition (ASR) suitable for usage within noisy environment and suggests optimum configuration under various noisy conditions. The behavior of standard parameterization techniques was analyzed from the viewpoint of robustness against background noise. It was done for Mel-frequency cepstral coefficients (MFCC), Perceptual linear predictive (PLP) coefficients, and their modified forms combining main blocks of PLP and MFCC. The second part is devoted to the analysis and contribution of modified techniques containing frequency-domain noise suppression and voice activity detection. The above-mentioned techniques were tested with signals in real noisy environment within Czech digit recognition task and AURORA databases. Finally, the contribution of special VAD selective training and MLLR adaptation of acoustic models were studied for various signal features.

Methods for Speech SNR Estimation: Evaluation Tool and Analysis of VAD Dependency

  • Department: Department of Circuit Theory
  • Annotation:
    The tool can estimate the SNR of noisy speech signal with or without reference signal. The tool can be also used to create a speech and noise mixture with required SNR.

Speech reduction in Czech

  • Authors: Kolman, A., doc. Ing. Petr Pollák, CSc.,
  • Publication: LabPhone 14. The 14th Conference on Laboratory Phonology. Tokyo: National Institute for Japanese Linguistics in Tokyo, 2014, Available from: http://www.ninjal.ac.jp/labphon14/LP14_FINAL_20140708.pdf
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    The present study contributes to this research by investigating speech reduction in Czech. Our study is based on the Nijmegen Corpus of Casual Czech, which was recorded in Prague in November 2008, and consists of 39 hours of casual conversations between 26 groups of three friends. We studied speech reduction in this corpus by focusing on a number of frequent words and frequent phoneme sequences. First, we see patterns that have also been observed in other languages. Second, Czech also shows clear effects of morphology, which has not been attested for other languages so far. A third interesting topic concerns syllabic consonants and we find that the segment's probability to be absent is modulated by the complexity of the resulting consonant cluster. Our study of Czech clearly shows that it is worthwhile to extend the study of reduction to typologically different languages.

Impact of Irregular Pronunciation on Phonetic Segmentation of Nijmegen Corpus of Casual Czech

  • Authors: Mizera, P., doc. Ing. Petr Pollák, CSc., Kolman, A., Ernestus, M.
  • Publication: Text, Speech, and Dialogue. 17th International Conference, TSD 2014. Heidelberg: Springer, 2014. pp. 499-507. Lecture Notes in Artificial Intelligence. ISSN 0302-9743. ISBN 978-3-319-10815-5.
  • Year: 2014
  • DOI: 10.1007/978-3-319-10816-2_60
  • Link: https://doi.org/10.1007/978-3-319-10816-2_60
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes the pilot study of phonetic segmentation applied to Nijmegen Corpus of Casual Czech (NCCCz). This corpus contains informal speech of strong spontaneous nature which influences the character of produced speech at various levels. This work is the part of wider research related to the analysis of pronunciation reduction in such informal speech. We present the analysis of the accuracy of phonetic segmentation when canonical or reduced pronunciation is used. The achieved accuracy of realized phonetic segmentation provides information about general accuracy of proper acoustic modelling which is supposed to be applied in spontaneous speech recognition. As a byproduct of presented spontaneous speech segmentation, this paper also describes the created lexicon with canonical pronunciations of words in NCCCz, a tool supporting pronunciation check of lexicon items, and finally also a minidatabase of selected utterances from NCCCz manually labelled on phonetic level suitable for evaluation purposes.

Small and Large Vocabulary Speech Recognition of MP3 Data under Real-Word Conditions: Experimental Study

  • Authors: doc. Ing. Petr Pollák, CSc., Borský, M.
  • Publication: Communications in Computer and Information Science. 2012, 314 409-419. ISSN 1865-0929.
  • Year: 2012
  • DOI: 10.1007/978-3-642-35755-8_29
  • Link: https://doi.org/10.1007/978-3-642-35755-8_29
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the study of speech recognition accuracy both for small and large vocabulary task with respect to different levels of MP3 compression of processed data. The motivation behind the work was to evaluate the usage of ASR system for off-line automatic transcription of recordings collected from standard present MP3 devices under different levels of background noise and channel distortion. Although MP3 may not be an optimal compression algorithm, the performed experiments have prooved that it does not distort speech signal significantly for higher compression rates. Realized experiments showed also that the accuracy of speech recognition (both small- and large-vocabulary) decreased very slowly for the bit-rate of 24 kbps and higher. However, slightly different setup of speech feature computation is necessary for MP3 speech data, mainly PLP features give significantly better results in comparison to MFCC.

Accuracy of MP3 Speech Recognition Under Real-World Conditions. Experimental Study

  • Authors: doc. Ing. Petr Pollák, CSc., Běhunek, M.
  • Publication: Proceedings of SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications.. Sevilla: University of Seville, 2011. pp. 5-10. ISBN 978-989-8425-72-0.
  • Year: 2011
  • Department: Department of Circuit Theory
  • Annotation:
    This paper presents the study of speech recognition accuracy with respect to different levels of MP3 compression. Special attention is focused on the processing of speech signals with different quality, i.e. with different level of background noise and channel distortion. The work was motivated by possible usage of ASR for offline automatic transcription of audio recordings collected by standard wide-spread MP3 devices. The realized experiments have proved that although MP3 format does not distort speech significantly especially for high or moderate bit rates and high quality of source data. The accuracy of connected digits ASR decreased very slowly up to the bit rate 24 kbps. For the best case of PLP parameterization in close-talk channel just 3% decrease of recognition accuracy was observed while the size of the compressed file was approximately 10% of the original size. All results were slightly worse under presence of additive background noise and channel distortion.

The optimization of PLP feature extraction for LVCSR recognition of MP3 data

  • Authors: Borský, M., doc. Ing. Petr Pollák, CSc.,
  • Publication: 19th International Conference on Applied Electronics 2014. Pilsen: University of West Bohemia, 2014. p. 55-58. ISSN 1803-7232. ISBN 978-80-261-0276-2.
  • Year: 2014
  • Department: Department of Circuit Theory
  • Annotation:
    This paper analyses the contribution of optimized PLP feature extraction setup and application of feature normalization to improve the performance of automatic speech recognition system for data compressed by MP3 algorithm. The experimental study performed on loop-digit recognition and large vocabulary continues speech recognition task showed that proper setup can negate the effect of lower compression rates which can achieve results comparable with higher rates. The second finding is that the normalization techniques contribute significantly to overall performance, specially for shorter windows/shifts and lower compression rates. The acoustic models trained on 160kbits/s, 32kbits/s and 16kbits/s data performed at 34.17%, 41.88% and 36.4% WER respectively on LVCSR task. In comparison the noncompressed acoustic models performed at 28.56% WER.

Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices

  • Authors: doc. Ing. Petr Pollák, CSc., Rajnoha, J.
  • Publication: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Paris: ELRA, 2010. ISBN 2-9517408-6-7.
  • Year: 2010
  • Department: Department of Circuit Theory
  • Annotation:
    This paper describes Czech spontaneous speech database of lectures collected at Czech Technical University in Prague, commonly with the procedure of its recording and annotation. In this article, special attention is paid to the description of time synchronizations of signals recorded by two independent devices. This synchronization is based on cross-correlation analysis with simple automated selection of suitable short signal subparts. The database contains 21.7 hours of speech material recorded in 4 channels with 3 principally different microphones. The annotation of the database is composed from basic time segmentation, orthographic transcription, pronunciation lexicon, session and speaker information, and the documentation. The collection and annotation of this database is complete and its availability via ELRA is currently under preparation.

Responsible person Ing. Mgr. Radovan Suk