You Too Can Bring About Change (by Denny K Miu)
Norcal Waste Systems on Reconnex

The VoIP MOS Debacle (by J. Scott Haugdahl)

ScottBitcricketlogoAuthor Profile - J. Scott Haugdahl is the founder and CTO of his new venture, Bitcricket. He was former Chief Technology Officer at WildPackets and holds a degree in Computer Science from the University of Minnesota, Institute of Technology. He has extensive experience in the network analysis industry in the areas of speaking, writing, competitive analysis, on-site training and network troubleshooting, and expert systems design and implementation. Industry vertical expertise include 802.11 wireless, VoIP, performance analysis and Apdex (Application Performance Index), and good old fashioned protocol analyzer detective work (including forensics). His past entrepreneur experience includes founding Net3 Group, where he wrote the industry’s first analyzer agnostic expert system. He continues writing his popular blog and can be reached by email at scott (at) bitcricket (dot) com.


Scottbanner


In a nutshell, MOS stands for Mean Opinion Score, a rating system for voice transmission quality. Computer generated versions simulate how a group of real listeners would rate the quality of a call. MOS scores range from 1 to 5, where 1 is interpreted as unintelligible and 5 is considered perfect. As you can imagine, MOS is highly subjective in the ears of the beholder. That’s why there’s the big “O” for opinion in the middle of that acronym.

Apparently the same holds for the analyzer vendors applying their algorithms to VoIP streams. In fact, the only consistency about MOS as reported by various packet analysis tools is – you guessed it - inconsistency. The same packet trace run through two analyzers gave me MOS scores of 3.0 and 3.8 for the exact same call. Another call I thought sounded pretty good was rated 2.8. A call I thought was worse was rated 3.4. It’s the sort of thing that can drive you nuts.

Factors affecting VoIP quality include the round trip delay between the talker and listener, some distortion in the CODEC (Coder-Decoder or Compressor-Decompressor), packet jitter, and lost packets. The CODEC, the component that translates analog voice to a digital stream and vice-versa, plays a relatively small role in the grand scheme of things.

For example, using the G.711 CODEC will produce better sound than G.723.1 at the expense of 10x more bandwidth (64 Kbps vs. 6.3 Kbps not including the packet header overhead). How much MOS score does that bandwidth increase gain you? Roughly half a point. The point is to understand what the tradeoffs are in quality (that should be reflected in the MOS) using different CODECs.

Incidentally, the CODEC factor is a no-brainer for an analyzer to compute by simply looking for the CODEC type in the VoIP Real Time Protocol (RTP) packet header operating over UDP. If the CODEC is RTP dynamic, then the analyzer will need to correlate it back to the SIP or H.323 signaling. For instance, 4.2 is the best possible score using G.711 (a perfect 5 is not obtainable with digital compression and other factors).

It’s important to understand that in a VoIP conversation between two parties, there are two completely independent streams, one in each direction. In fact, they do not even require the same CODEC. As such, it is best to measure and analyze the streams independently, especially the MOS scores. Single stream playback can come in handy for noise analysis. Use combined stream playback to get a feel for the effect of round trip delay on the conversation.

As an aside, being able to playback a VoIP call from a packet capture is a bit controversial. Some enterprises forbid it (and even some countries ban it) for obvious reasons. Law enforcement wants it for different reasons. I like it since you can place your analyzer on different segments in the call path and experience firsthand where the call quality starts to deteriorate. On the other hand, isn’t that what we rely on MOS scores for?

Preventing packet jitter is the primary reason we tune our networks for VoIP QoS by giving it priority passing through our routers and switches. Unlike a file transfer that can bounce all over the place in terms of instantaneous throughput vs. average throughput when done, VoIP requires a consistent delivery just like a hospital IV drip.

The secret sauce in some VoIP handsets is a jitter buffer that can dynamically adjust to allow for network delay and variance in packet delivery. By delaying the entire conversation, say 40 milliseconds, we can smooth out the delivery of the packets to the handset, say every 20 milliseconds (typical for G.711). Packets delayed more than 40 ms from the previous packet are dropped at the receiver (I’m simplifying here, but you get the picture).

Since the jitter buffer adds to the network delay, it’s desirable to keep the buffer in the neighborhood of under 100 ms. A worst case scenario is variable packet delivery in a highly latent network with a large jitter buffer to compensate. To avoid interrupting and stepping on each other, users resort to a walkie-talkie style of communication, akin to bad cell phone call delay especially if the cell phones are on different provider networks.

This brings us to lost packets.

Analyzers can certainly detect lost packets on the network but what about packets dropped at the handset? Are they not as good as lost? They are indeed, and you will not see them factored into the MOS score unless the tool vendor attempts to emulate the secret sauce of your handset, which is typically a closely guarded secret.

Furthermore, artificially bridging the audio gap between dropped packets can also vary depending on the handset and, again, affect the perceived quality of the call to the listener. Bridging gapped packets is also not accounted for in analyzer-generated MOS scores.

You may be wondering why I didn’t include echo as one of the factors affecting VoIP quality. I left it out since it is an analog component, just like a volume control on a headset. (Should volume somehow be factored into the score as well?)

A common source of echo is when VoIP bridges into the PSTN—our good old land line system. Other sources include poorly designed headsets that may be picking up cross talk. For example, the microphone faintly looping back what just emerged from the earphones. If there is echo in the VoIP stream, it is extremely difficult for analyzer algorithms to detect it and reduce the MOS score accordingly – I have yet to see a passive analysis tool do this effectively. It typically requires active analysis to generate voice streams with special end points in addition to the analysis tool.

As you can imagine, the most accurate MOS score that a passive packet capture tool can report is when capturing as close to the listener as possible to get accurate jitter and packet loss measurements. To measure both sides at the same time, you’ll need to deploy analyzers at each end. Mind you, the analysis needn’t be done in real time. You could deploy TShark at the end points (in the case of capturing PC soft phone VoIP for instance) and do the comparative MOS analysis later using your commercial product.

Or you could deploy a low-cost capture device and SPAN off the mirror port of the edge workgroup switch. SPANing a single user port to an analyzer co-located at the edge will give you reasonable millisecond timing. Do not get into RSPAN or resort to gathering an entire VLAN within a switch to the SPAN port – your timing – and MOS scores – could be thrown off. For more on the limits of SPAN, please take a look at Tim's article.

Perhaps the most accurate MOS scores are from the handsets themselves. Some models have embedded algorithms that compute MOS. Check with your vendor.

Speaking of handsets, most will also send out periodic RTP Control Packets (RTCP) that includes jitter and packet loss information (which includes packets too late for the jitter buffer). TIP: Does your analyzer include information provided by RTCP packets to improve the MOS score? Find out by comparing MOS scores in a VOIP stream with and without RTCP packets (apply a filter and recheck the score).

Even better are end nodes that support the newer RTCP XR eXtended Reports (RFC 3611), which contain additional metrics like burst lost (consecutive packet loss), round trip delay, and MOS as reported by the handset (if supported). RTCP XR is not yet as popular as RTCP as a supported option so, again, check with your VoIP vendor.

I also like to see an analyzer report both the listening quality (MOS-LQ) and conversation quality (MOS-CQ). Networks with high latency (such as a VPN over long distance Internet) can maintain a good MOS-LQ score, sufficient for those corporate earnings update conference calls where most users are simply listening. On the other hand, for a highly interactive conversation you’ll want a good MOS-CQ score.

Putting all the above call quality factors into an algorithm and you have the MOS score. Another popular metric is the R factor, which is similar to a MOS score on a different scale (1 to 100 or higher for wide-band applications).

In developing their MOS score algorithms, most vendors start with the infamous ITU G.107 standard entitled “The E-model, a computational model for use in transmission planning” (the E-model actually produces an R value) and add their own proprietary methods thereafter. The E-model is shown in the figure below.

Emodel

The E-model Reference (Source: ITU G.107)


Unfortunately, many of the metrics required in the E-model that deal with mouth to ear cannot be obtained by passive packet analysis alone and must be filled with “typical” values. For example, the model includes such factors as sender and listener loudness rating (Ds and Dr in the figure), equipment impairment (Ie in the figure, some which can be factored based on the CODEC), and even sender and listener room noise (Ps and Pr in the figure).

Yet another variable is the advantage factor (referred to as the “expectation factor A” in the figure), which has different values depending on the handset usage. For instance, a mobile call is not expected to be as robust as a stationary desk phone and the score needs to be adjusted accordingly.

But wait, there’s more!

There’s another user factor not covered by the E-model that has nothing to do directly with VoIP quality but affects the MOS score. This has been referred to as the “recency” effect (a term I believe first coined by Telchemy). Simply put, users tend to be forgiving of bad quality early in the call but less forgiving of bad quality later in the call. This needs to somehow be reflected in the MOS score – higher for the former and lower for the latter.

So just what are “good” and “bad” MOS scores? Unfortunately, it can be subjective depending on the vendor. A subjective, subjective score! It’s best that you rate the calls yourself under different real network conditions (load, time of day, location, etc.) and establish a baseline of MOS scores with your analyzer. As a general rule, a difference of more than 1 point represents a significant difference in quality, regardless of your tool (but confirm this). VoIP streams with scores of 4+ should indicate superb voice quality. Streams with scores under 3 should be of poor to fair quality.

The bottom line is that no one vendor uses the exact same algorithm to compute MOS unless they OEM the code from the same source. And believe me when I say from experience that some are more accurate than others. After all, it makes sense for competitive reasons to be as accurate as possible compared to human-derived MOS scores were they to judge the same call. When selecting a tool, insist on test data to back up vendor claims; typically real call data under different scenarios validated against human listeners conducted by a legitimate third party tester.

Above all, be heard and baseline and calibrate your MOS tools and user expectations against your own VoIP infrastructure.

Meanwhile, I strongly encourage VoIP tool vendors to be more forthcoming with details about how they compute their MOS score. Just saying that it’s based on the E-model is inadequate. We need to see testing validation and better yet, a collaboration among vendors to provide a more consistent score that reflects reality. Perhaps a third party not unlike the WiFi Alliance is willing to step forward and provide a MOS certification process. I’d love to see that. Then, I’d really love my tool.


Continue reading other posts from “The Network Guy” column by J. Scott Haugdahl »


Comments