Accurately Measuring VoIP Performance (By Robert J. Merrill)

Author Profile - Robert Merrill is Telchemy’s Creative Writer. Before joining Telchemy in 2007, he worked for eight years (and surfed a succession of corporate mergers) as an Engineering Technical Writer for BellSouth.net, BellSouth Science and Technology and AT&T Labs. He enjoys writing, learning foreign languages, riding motorcycles, and drinking craft-brewed and imported beer. Robert suggests that the riding of motorcycles and drinking beer, no matter how finely brewed should not be done simultaneously and tasting should only begin after the cycle is put up for the day. Robert writes and works with Dr. Alan Clark, the Founder and CEO of Telchemy and his skills for accurate review of technologies are well reviewed and formidable.
Overview
As almost anyone who has experienced voice over IP at home or work can attest, VoIP, for all its cost and convenience benefits, is hardly problem-free with respect to reliability and quality of service. If Hum, Crackle, and Pop were the lovable elves of the PSTN, their Orc counterparts—Echo, Garble, and Chop—have proven far less endearing to VoIP users. And although the ubiquitous cell phone has helped to lower the bar of expectation for acceptable voice quality, there remains a limit to the amount of noise, distortion, or delay most callers will tolerate before becoming annoyed. Put simply, the overall goal of VoIP quality measurement is to determine what that annoyance threshold is, to know when it’s being crossed, and to figure out the cause and thus (hopefully) prevent it from happening again.
In the VoIP world, annoyances take the form of gaps in speech, echo, “robotic” or hollow-sounding speech, clipped speech, and various kinds of distortion and noise. A number of factors can contribute to these problems, but the big three causes are the codec itself and the levels of packet loss and jitter (i.e., excessive variation in the packet arrival interval, which can lead to packets being discarded).
All codecs impact call quality to some degree, simply as a byproduct of the digitization, compression, and packetization process—generally speaking, the lower the bitrate used, the greater the level of distortion, and the lower the nominal (maximum possible) MOS for a call transmitted using that codec. (For example, using G.711 the nominal MOS is 4.4 on the ITU scale or 4.1 on the ACR scale; I’ll say more about MOS scaling later.)
The impairment level introduced by the codec is static, and in most cases negligible. Packet loss and jitter, however, can vary widely and have a number of causes—LAN or access link congestion, route flapping, timing drift, and many other issues related to queuing, bandwidth, etc. Furthermore, the way packet loss and discard are distributed during a call helps determine whether or not the degradation is apparent to the end user. A consistent level of low packet loss may not be noticeable; however, loss/discard events typically occur in “bursts” that can degrade voice quality for several seconds at a time. To accurately measure voice quality as perceived by the end user, any performance analysis technology used by probes/analyzers and VoIP endpoint equipment should take into account the bursty nature of packet loss/discard. (As it happens, Telchemy’s VQmon technology does just that—which I’ll explain a bit further on.)
Testing Methods
So how is voice quality measured? The “gold standard” is subjective testing, a common form being the Absolute Category Rating (ACR) test: enlisting a pool of test subjects to listen to the audio sample and rate the quality on an “Opinion Score” scale of 1 to 5, with 5 being “Excellent,” 3 “Fair,” and 1 “Unacceptable.” From these results, a Mean Opinion Score (MOS) for voice quality can be calculated. Being subjective, results can of course vary, but given a large enough pool (generally 16 or more) the scores tend to stabilize sufficiently.
For voice tests, it’s useful to determine separate MOS scores for listening and conversational quality (MOS-LQ, MOS-CQ). Some impairments—such as echo and delay—may not be apparent to someone who is merely listening, but even a brief delay can make carrying on a conversation extremely frustrating. (The intensity of interaction also plays a role in the perception of quality; the same level of delay is likely to be more noticeable during a heavy business negotiation or argument than in a casual tête-à-tête.)
The subjective method is still commonly used for codec testing, but isn’t a workable solution for everyday performance management of VoIP service; few network managers would consider it practical to poll individual callers’ opinions of every call, let alone employing a pool of eavesdropping judges at each end.
PESQ, or Perceptual Evaluation of Speech Quality (P.862) and its predecessor PSQM, Perceptual Speech Quality Measure (P.861) introduced objective methods of evaluating the speech quality of narrowband codecs by comparing a voice signal that has traveled through a VoIP network with an original version of the same signal and analyzing the level of distortion. While this algorithm does return a MOS without the participation of our panel of judges, it has its drawbacks. PESQ is a one-way measurement that doesn’t consider the effects of delay, sidetone level, echo, etc., which would be factors in a real-life conversation; therefore, the resulting MOS represents listening but not conversational quality. Moreover, this method requires access to both the original and degraded sample, making it suitable for specific testing applications but impractical for day-to-day measurement of VoIP call quality.
Enter non-intrusive (a.k.a. passive) monitoring, a method of calculating quality scores and other performance metrics by analyzing live voice traffic mid-stream or at one or both endpoints (or ideally, using a combination of mid-stream measurement and endpoint reporting). The most accurate results are obtained using a distributed model, in which performance analysis agents (such as Telchemy’s VQmon) are integrated into strategically placed mid-stream probes/analyzers as well as embedded into IP phones and gateways. Endpoints equipped with reporting agents generate and send quality reports at the end of each call or periodically throughout the call, inserting RTCP SR/RR/XR payloads into the returning voice stream. These reports can then be captured by the probe/analyzer and correlated with the metrics it gathers via its own direct monitoring of the packet stream—and/or they can be collected by a central mediation and management application, such as Telchemy’s SQmediator. This approach has a number of advantages:
- MOS scores and performance metrics are calculated at the handset, ensuring the highest possible degree of accuracy.
- Endpoints produce jitter buffer statistics and analog metrics including signal, noise, and echo levels, which are not available using mid-stream analysis alone.
- Calls are monitored and analyzed non-intrusively, making it possible to track the quality of every call placed across the network without a significant increase in network load.
- Analysis is performed on live call streams without using packet capture, avoiding the potential legal implications of capturing and storing actual voice data.

This distributed model offers a scalable architecture for non-intrusive VoIP performance monitoring—but how are quality scores actually calculated, and how accurate are they?
The E Model
The E Model, standardized by the ITU in 1998 as Recommendation G.107, provides a method for calculating a single metric representing voice quality—the R-factor—which can then be converted into estimated listening and conversational quality MOS scores. Nominal R-factor scores range from 0 to 120, with typical scores ranging from 50-94 for narrowband telephony and 50-110 for wideband telephony. The E Model has a number of advantages: it lends itself to non-intrusive monitoring applications, provides accurate and repeatable results, and requires a small fraction of the computation required by process-intensive algorithms such as P.563 (another passive method of calculating call quality).
The E model is based on the premise that the effects of impairments are additive. The basic equation is:
R = Ro - Is - Id - Ie + A
Where Ro is the base signal-to-noise ratio; Is represents impairments occurring simultaneously with speech (such as loudness, quantization, and improper sidetone level); Id represents delayed impairments such as echo and delay; Ie represents impairments introduced by the VoIP equipment, such as codec distortion, packet loss, and packet discard; and A is the “Advantage Factor,” representing the user’s expectation of quality related to the convenience of the method used—for example, as cell phones offer more convenience, their users are inclined to be more forgiving of quality problems encountered while using them.
The R-factor produced by the E model can be converted to a MOS score, but here’s where it gets somewhat sticky. Although ITU G.107 provides a mapping function for converting R to MOS, recent data from subjective ACR tests suggests that MOS scores should be actually be mapped slightly lower. For example, the nominal R-factor for an unimpaired G.711 call is 93, which (as I mentioned earlier) equates to a MOS of 4.4 using the ITU scaling, but 4.2 using ACR scaling.
As if that weren’t enough, there exists yet another scaling method developed by the Japanese Telecommunication Technology Committee (TTC), based on their own subjective tests, which results in even lower MOS scores than the ITU and ACR scalings. The Japanese, it seems, tend to be somewhat tougher judges than their friends in Europe and the United States when it comes to call quality.
The following diagram shows the relationship between G.107 scaling, ACR scaling, and Japanese TTC scaling when converting R-factor to MOS.

Another potential point of confusion: although wideband codecs can range higher on the R-factor scale than narrowband codecs (typical R-factor of 50-110 compared to 50-94 for narrowband), the same 1-5 MOS scale applies to both. It’s therefore entirely possible for a wideband codec to produce a lower MOS than a narrowband codec, even though the wideband codec sounds better. This should be taken into consideration when determining “acceptable” MOS values for a particular codec, particularly if you are using both wide- and narrowband codecs and comparing the values of each.
Enhancing the E Model: VQmon
The E Model, while highly useful, has some limitations that impact its overall accuracy when determining call quality under real-world conditions. To address these issues, Telchemy introduced VQmon, a performance analysis agent based upon the E Model, with a number of extensions and improvements that I’ll now describe. (Warning: this section may contain mild horn-blowing.)
One of the drawbacks of the E Model is its failure to consider how the time-varying nature of IP impairments impacts call quality as perceived by the end user. Packet loss and discard events tend to be “bursty,” with periods of relative good quality interrupted by brief periods of degradation, which can last for several seconds and be highly noticeable to users even though the overall rate of packet loss/discard for the call may remain relatively low. When Robot A calls Robot B, they may be perfectly content to appraise the call’s quality by simply counting the lost packets; with our inferior carbon-based brains, we tend to be more sensitive to the actual distribution of loss.
VQmon uses a statistical model to learn the distribution of lost and discarded packets during a call, and includes this information in its quality calculation. This improves the accuracy of the reported MOS and R-factor scores and enables VQmon to report specific metrics for burst periods (where quality is degraded) and gap periods (where quality is relatively good). The 4-state Markov Model applied by VQmon is depicted below.

To minimize the effects of lost or discarded packets, many types of VoIP equipment employ Packet Loss Concealment (PLC) algorithms at the receiving end of the stream. PLC techniques—which include repeating the previous packet received or simply inserting silence—are most effective on small numbers of consecutive lost packets (e.g., 20-30 ms of lost speech) when the packet loss rate is fairly low. VQmon measures the length of loss periods and calculates the effectiveness of the PLC algorithm in use, including the data in its quality score computation.
VQmon also addresses three types of phenomena related to human perception when calculating R-factor and estimated MOS scores. The first involves the natural delay in listener reaction time when quality changes from good to bad (or vice versa); the second relates to the “recency effect,” or the tendency of listeners to judge call quality problems more harshly when they occur closer to the end of the call—much the way a burnt potato chip, when eaten last, may sour one’s recollection of the rest of the bag; and the third relates to the way that listeners perceive multiple different impairments (for example, packet loss and echo on the same call).
VQmon has been extensively tested to ensure that its quality scores correlate closely to those produced by subjective testing. The following diagram shows results from one test, in which listening quality (MOS-LQ) scores generated by VQmon were compared with MOS-LQ scores obtained from subjective Absolute Category Rating (ACR) test data, with each point on the graph representing one of 30 tested codecs. The points cluster tightly around a trend line that closely follows the 45 degree ideal, indicating a precise correlation factor (0.9934) between VQmon’s derived scores and the ACR scores.

Additional information on the accuracy of the VQmon algorithm, including comparison with ACR MOS, PESQ, and E Model test results, can be found here.
Finally, although this article has focused primarily on quality scores, it should be mentioned that VQmon reports an extensive set of performance metrics in addition to MOS and R-factor scores, including RTCP XR metrics. These include statistics for packet loss and discard, both overall and during burst and gap periods, and (in reports generated by VQmon-equipped endpoints) PLC and jitter buffer configuration information, delay statistics, and analog metrics including signal, noise, and echo levels. VQmon also generates a list of specific degradation factors for each call (such as packet loss/discard, delay, codec used, etc.) and the percentage of quality degradation attributable to each, which helps to facilitate troubleshooting.
Conclusion
Accuracy and consistency are paramount in call quality score calculation, if performance monitoring is to achieve its goal: pinning down where and when VoIP call quality suffers, why it’s suffering, and what prescription is needed to ease its pain. From an architectural standpoint, the best approach is a distributed performance management model—deploying endpoints equipped with reporting agent software together with strategically-placed mid-stream passive probes.
As I’ve described, Telchemy’s VQmon technology offers a number of improvements over the traditional E Model for accurate quality score calculation. And because VQmon has already been widely adopted—with over 36 million units currently licensed by more than 90 equipment vendors, integrated into a range of probes, analyzers, test equipment, and IP phones and gateways—an extensive variety of products featuring the same consistent core technology is available for both enterprise and service provider VoIP performance management applications.
Over 30 test equipment vendors use VQmon for VoIP analysis, ensuring a high degree of consistency across the industry. For more information on VQmon and Telchemy’s line of embedded software and OEM products for performance management of VoIP, IPTV, and IP Videoconferencing, please refer to www.telchemy.com for more case studies and technology reviews.



Comments