According to Prof. Holub, the broadband technologies used for the last 20 years or so, as well as the more recent use of surround sound, aim at the most realistic voice recording possible - i.e., the listener should hear the caller "as if sitting by his side". Many people use these networks every day, for example in online communication applications, mobile phones and other indispensable technologies of today.
Standardised methods are used to measure the quality of various parameters, including voice transmission. "These methods are both subjective, i.e. listening or conversational tests, and objective, where the tests are performed using algorithms that should give a similar result to the subjective tests," explained Prof. Holub, who together with his colleagues at the CTU Faculty of Engineering has been working on the quality of voice transmission in communication networks since the late 1990s. He noted that these procedures are used, for example, when a mobile operator selects a new technology for deployment in its network.
According to Prof. Holub, the broadband technologies used for the last 20 years or so, as well as the more recent use of surround sound, aim at the most realistic voice recording possible - i.e., the listener should hear the caller "as if sitting by his side". Many people use these networks every day, for example in online communication applications, mobile phones and other indispensable technologies of today.
Standardised methods are used to measure the quality of various parameters, including voice transmission. "These methods are both subjective, i.e. listening or conversational tests, and objective, where the tests are performed using algorithms that should give a similar result to the subjective tests," explained Prof. Holub, who together with his colleagues at the CTU Faculty of Engineering has been working on the quality of voice transmission in communication networks since the late 1990s. He noted that these procedures are used, for example, when a mobile operator selects a new technology for deployment in its network.
How to apply the study in practice?
The study, which is backed by prof. Ing. Jan Holub, Ph.D., doc. RNDr. Kateřina Helisová Ph.D. and postgraduate student Ing. Yann Kowalczuk, resulted in a draft recommendation from the European Telecommunications Standards Institute (ETSI). Among the member organizations that supported the proposal of the team from FEL, Prof. Holub said, besides a number of companies, was also the NATO Communication and Information Agency (NCIA).
The draft recommendation was discussed by the ETSI STQ committee, adopted by the delegates after incorporating a number of comments, and published on 25.10.2023 under ETSI TR 103 950: Gender-related aspects of listening quality and effort in speech communication systems. This opens the way for a better balance of these aspects in the design of future codecs and transmission systems.
How the problem arises
According to Prof. Jan Holub, the worse transmission of female voices has been known so far with older narrowband connections, for example with amplitude modulation (AM), which is still used for safety reasons in air traffic. "It is typical for narrowband that an average sounding female voice, which has a higher pitched fundamental tone and all of its energy is higher in the spectrum, is frequency clipped. So the information is technically harder to transmit than in the case of deeper male voices. However, this is 'nicely compensated' by the fact that narrowband transmissions tend to be in a cluttered environment, which in turn 'masks' male voices lying in a similar spectrum to the hustle and bustle," described Prof. Holub. Paradoxically, female voices are sometimes better understood in real life, even though the results from laboratory measurements show the opposite.
In the case of modern broadband and surround sound technologies, however, this "clipping" no longer occurs, and yet women's voices are often transmitted less well. "The reasons why the difference arises are quite well known. It's always a trade-off between some new criterion and how much data needs to be transmitted per call," Prof. Holub outlined. "One of the criteria in the design of a modern digital encoder is the frame length. The speech signal is divided into overlapping parts. The shorter the frames, the more frames there are, per minute or second. The longer they are, the fewer there are. If each of these sections is encoded through a library of instantaneous spectra into a finite number of bits, then in the final analysis, the longer the section, or thinner the packetization, the less data is transmitted," the scientist described the process. This, he said, also has implications for savings in the transmission network when, for example, part of the transmission path is leased. "Just by putting the female voice higher in the spectrum, a lot of the detail in a given time course is sped up. Hence, the larger the frames chosen, the harder it is to encode, as the encoder inside the frame assumes it is a quasi-steady signal. It cannot capture the rapid changes there well," the expert added.
The first step for improvement, he says, is to use shorter frames. "Which unfortunately has the direct consequence of increasing the required transmission speed. Or conversely, when designers are forced to fit a given bit rate, one option is to design a sufficient length of speech frame. This is a known fact," said Prof. Holub.
"Then there are the frequency filters that occur on the route. They aim to reduce noise outside the speech spectrum. These filters are historically designed so that they can suppress the higher frequency components - including part of the female voice spectrum. That's an easily fixable thing, but it's worse with packetization because it just costs something," the researcher said. He stressed that the requirement to reduce the statistically significant difference between the transmission of the average male and female voice is justified.
In his words, the female voice was not deliberately omitted. "The packetization has simply evolved historically from narrowbanding, where the frames were even longer or even more poorly coded, and it hasn't caught up yet," Prof. Holub noted. In reality, the current state of affairs can manifest itself in that, for example, messages dictated over the radio link have to be repeated multiple times and the communication takes longer.