Perceptual Evaluation of Speech Quality
(PESQ)
(Revised
02-11-04. All revisions in italics)
A Discussion
Paul Ordas and Brian Fox
Microtronix Systems Ltd.
In recent years a great deal of effort
has been expended to develop methods that determine the Quality of Service
(QOS) of networks through the use of comparative algorithms. These methods
are designed to calculate an index value of quality that correlates to
a mean opinion score given by human subjects in evaluation sessions. Typically
these methods make use of a recorded speech or simulated speech stimulus.
This speech stimulus is sent through the system under test and the output
signal is compared to the original.
There are a number of methods available
but this article is restricted to one of the more modern ones called PESQ
(Perceptual Evaluation of Speech Quality). The PESQ Algorithm is designed
to predict subjective opinion scores of a degraded audio sample. PESQ returns
a score from 4.5 to -0.5, with higher scores indicating better quality.
PESQ is designed to analyze specific
parameters of audio, including time warping, variable delays, transcoding,
and noise. It is primarily intended for applications in codec evaluation
and network testing.
The idea of PESQ is very appealing
because it would seem that it could provide a set of automated "golden
ears" to evaluate any type of audio system and give a useful indication
of the "quality" of the system. Make no mistake PESQ works very well when
used as intended but some big surprises await those who attempt to replace
traditional telephone evaluation methods with PESQ.
At Microtronix we have been evaluating
PESQ for the purposes of applying it to VoIP Telephone Testing. This has
been requested by a number people and we therefore decided to evaluate
it to determine how it would work for that purpose. What we found made
it clear to us that although it is a useful method to incorporate into
a Telephone Testing System it must be used as an adjunct to traditional
methods. This is because PESQ was not designed to evaluate some of the
factors that determine the "quality" of a Telephone. For example, PESQ
does not take into account frequency response and loudness, two very important
factors that affect the perceived quality of a telephone terminal.
In order to demonstrate this we have
placed three files on this web page. The first file, OR272.WAV,
is an original file of speech in the Dutch language (Nederlands) that is
provided with the ITU specification document for PESQ.
The next file, DG001.WAV,
is a degraded version of the original file. It has been degraded by mixing
a low level of white noise with the OR272.WAV
file. This file is not audibly different from the original when heard at
normal listening levels.
The third file, DG002.WAV
is equalized such that there is far less low frequency and high frequency
energy when compared to the original file. It is clearly audible that this
speech is degraded when you listen to it yet PESQ reports the quality of DG001.WAV and
DG002.WAV
are the same!
Below are the reported results given
by PESQ when comparing these files to OR272.WAV.
| DEGRADED |
PESQMOS |
SUBJMOS |
COND |
SAMPLE_FREQ |
CRUDE_DELAY |
| dg001.wav |
4.431 |
0.000 |
0 |
8000 |
-0.3600 |
| dg002.wav |
4.431 |
0.000 |
0 |
8000 |
0.1360 |
Both degraded files have a PESQ score
of 4.431 but the file degraded by white noise is virtually indistinguishable
from the original, while the file degraded by poor frequency response is
audibly of lower quality.
This discussion should not be
interpreted to imply that there is any flaw with PESQ. PESQ does not
attempt to define what 'quality' is; the purpose of the PESQ algorithm is
to objectively predict the subjective mean opinion scores in a P.800
listening setup. We believe that PESQ does what it was intended to
do, but users of PESQ must understand the scope of ITU-T P.862. The
PESQ scope does not include effects of loudness loss (ITU-T P.862 Table
2), nor frequency response variations of less than 20 dB (ITU-T P.862
10.2.6), and it is not validated for acoustic terminal testing (ITU-T
P.862 Table 3).
Listen to the Files here:
OR272.wav - Original File
DG001.wav
- File Degraded with low level White Noise
DG002.wav
- File Degraded by narrow band frequency response
Conclusion
PESQ can be used in addition to other
methods when evaluating the performance of a telephone terminal, but PESQ
alone cannot ensure good telephone quality. In order to fully evaluate
a telephone it is important to use methods like those asked for in the
TIA/EIA-810-A standard. Frequency Response, Loudness ratings and other
traditional telephone measurements used in conjunction with PESQ can guarantee
that VoIP telephones provide a quality of service that is equal to or better
than conventional POTS telephones.
Click Here to
view our new IP Phone Test System
|