Before reading

This article suites someone who understand the IP network routing and switching and want to study the Voice over IP technology.



Voice traffic has strict requirements in terms of delay, jitter and packet loss. Circuit switched telephone exchanges were able to fulfill those requirements and provide reliable and guaranteed voice services.

The need for carrying voice traffic over the data network has caused an evolution to voice architectural design both on the signaling call control part and media transportation.

The big question is: How to integrate connection-oriented voice traffic in a connectionless IP network and still provide a reliable service?

VoIP Protocols have been implemented to provide a solution to the above question.

Signaling protocols have been enhanced to address the call control requirements over IP networks. In addition, media transportation protocols have been designed to reliably transfer voice packets and effectively save bandwidth.

Recall that the Open Systems Interconnection model (OSI) is a conceptual model that characterizes and standardizes the communication functions of a telecommunication or computing system without regard to its underlying internal structure and technology. VoIP utilize this model as well to make the voice transmission layered.




Delay: The maximum one-way delay between any UCM servers for all priority ICCS traffic should not exceed 40 ms, or 80 ms round-trip time (RTT).

Jitter: Jitter is the varying delay that packets incur through the network because of processing, queue, buffer, congestion, or path variation delay.

Bandwidth: Provision the correct amount of bandwidth between each server for the expected call volume, type of devices, and number of devices.

QoS: The network infrastructure relies on QoS engineering to provide consistent and predictable end-to-end levels of service for traffic. Neither QoS nor bandwidth alone is a solution. Rather, QoS-enabled bandwidth must be engineered into the network infrastructure.

Gatekeeper: Cisco gatekeepers are used to group gateways into logical zones and perform call routing between them. Gateways are responsible for edge routing decisions between the Public Switched Telephone Network (PSTN) and the H.323 network. Cisco gatekeepers handle the core call routing among devices in the H.323 network and provide centralized dial plan administration. Without a Cisco gatekeeper, explicit IP addresses for each terminating gateway would have to be configured at the originating gateway and matched to a Voice over IP (VoIP) dial-peer. With a Cisco gatekeeper, gateways query the gatekeeper when trying to establish VoIP calls with remote VoIP gateways.

A voice-switching gateway, connects various analog and digital voice circuits. This functionality is equivalent to the operation of central office switches and PBXs in traditional telephony.

A VoIP gateway connects the traditional telephony network to the IP network. It converts the signaling and media transmission methods used on one side to the other side.

Cisco Unified Border Element (Cisco UBE) interconnects two IP networks. It terminates the signaling sessions and either passes through or terminates the media channels.

Analog Voice Port Interface

FXS: An FXS interface connects the router or access server to end-user equipment such as telephones, fax machines, or modems.

FXO: An FXO interface is used for trunk, or tie-line, connections to a PSTN CO or to a PBX that does not support E&M signaling (when the local telecommunications authority permits).  A standard RJ-11 modular telephone cable connects the FXO voice interface card to the PSTN or PBX through a telephone wall outlet.

E&M: Trunk circuits connect telephone switches to one another. They do not connect end-user equipment to the network.


Major Stages of Voice Processing in VoIP


For voice transmission over an IP network, the voice wavelength must be sampled, quantized, encoded, optionally compressed, and then encapsulated in a VoIP packet.

The first four steps are performed by a digital signal processor (DSP) in the originating gateway and are detailed in the following section. The VoIP packets are then delivered to the destination gateway, and the voice information is retrieved from the packet. Finally, a DSP on the terminating gateway decodes the payload and modulates the wavelength to reverse the process performed on the originating gateway.

VoIP components

The components are illustrated below:


The components shown are as follows:

Cisco Unified IP Phones: Provides an IP endpoint for voice communication.
Gatekeeper: Provides call admission control (CAC), bandwidth control and management, and address translation.
Gateway: Provides translation between VoIP and non-VoIP networks such as a public switched telephone network (PSTN). Gateways also provide physical access for local analog and digital voice devices such as telephones, fax machines, key sets, and PBXs.
Cisco Unified Border Element (Cisco UBE): Interconnects two VoIP networks. It acts as a proxy between signaling protocols and can be configured to provide proxy services to the media stream.
Multipoint control unit (MCU): Provides real-time connectivity for participants in multiple locations to attend the same videoconference or meeting.
Call agent: Provides call control for Cisco Unified IP Phones, CAC, bandwidth control and management, and address translation.
Application servers: Provide services such as voice-mail, unified messaging, interactive voice response (IVR), presence information, multimedia conferencing, and others.
Videoconference station: Provides access for end-user participation in videoconferencing. The videoconference station contains a video capture device for video input and a microphone for audio input. The user can view video streams and hear the audio that originates at a remote user station.


Sampling is a process that takes readings of the waveform amplitude at regular intervals, by a process called pulse-amplitude modulation (PAM). The output is a series of pulses that approximates the analog waveform. For this output to have an acceptable level of quality for the signal to be reconstructed, the sampling rate must be rapid enough.


Quantization divides the range of amplitude values that are present in an analog signal sample into a set of discrete steps that are closest in value to the original analog signal. Each step is assigned a unique digital codeword. Quantization matches a PAM signal to a segmented scale. The scale measures the amplitude (height) of the PAM signal and assigns an integer number to define that amplitude.

Figure below shows quantization in action. In the example, the x-axis represents time, and the y-axis represents the voltage value. The output is a series of pulses that approximates the analog waveform.

The voltage range is divided into 16 segments (0 to 7 positive, and 0 to 7 negative). Starting with segment 0, each segment has less-granular intervals than the previous segment, which reduces the signal-to-noise ratio (SNR) and makes the segment uniform. This segmentation also corresponds closely to the logarithmic behavior of the human ear.

The two principal schemes for generating these samples in electronic communication are a-law and mu-law. a-law and mu-law are audio compression schemes, defined by ITU-T G.711, that compress 16-bit linear PCM data down to 8 bits of logarithmic data. The alaw standard is primarily used in Europe and the rest of the world, while mu-law is used in North America and Japan.

The similarities between mu-law and a-law include the following:
Both are linear approximations of the logarithmic input/output relationship.
Both are implemented using 8-bit codewords (256 levels, one for each quantization interval). Eight-bit codewords allow for a bit rate of 64 kbps. This is calculated by multiplying the sampling rate (twice the input frequency) by the size of the codeword (2 * 4 kHz * 8 bits = 64 kbps).
Both break a dynamic range into a total of 16 segments:

Eight positive and eight negative segments.
Each segment is twice the length of the preceding one.
Uniform quantization is used within each segment.

Both use a similar approach to coding the 8-bit word:

First bit (MSB) identifies polarity.
Bits two, three, and four identify segment.
Final four bits quantize the segment.

The differences between mu-law and a-law include the following:
Different linear approximations lead to different lengths and slopes.
The numerical assignment of the bit positions in the 8-bit codeword to segments and the quantization levels within segments are different.
a-law provides a greater dynamic range than mu-law.
mu-law provides better signal-distortion performance for low-volume signals than a-law.
a-law requires 13 bits for a uniform PCM equivalent, while mu-law requires 14 bits for a uniform PCM equivalent.
An international connection must use a-law, and mu-law to a-law conversion is the responsibility of the mu-law country.


Coding converts an integer base-10 number to a binary number. The output of coding is a binary expression in which each bit is either a 1 (pulse) or a 0 (no pulse). After PAM samples an input analog voice signal, the next step is to encode these samples in preparation for transmission over a telephony network. This process is called pulse-code modulation (PCM).

The PCM process, as shown in Figure 2-5, mathematically converts the value obtained from PAM sampling to another binary value within the range –127 to +127. It is at this stage that companding, the process of first compressing an analog signal at the source and then expanding this signal back to its original size when it reaches its destination, is applied. This entire process is generally referred to as PCM coding. A DSP, which is a specialized chip, quickly performs the PCM proces

In the United States, Canada, and Japan, mu-law is used. The rest of the world uses a-law. Both mu-law and a-law companding produces PCM values in the range of –127 to +127. Both mu-law and a-law represent a positive sign value with a value of 1, and a negative sign value with a value of 0. This representation is a departure from the “normal” computational use where positive is usually represented by 0.

Of the two methods, a-law appears to be the more logical method, because a PCM value of +127 is represented as 11111111; in other words, a positive sign value (the first bit) followed by a binary value of 127 composed of the segment and interval bits. Similarly, –32 is represented as 00100000. Mu-law operates a bit differently by logically inverting the segment and interval bits. Using mu-law companding, the value of +127 becomes 10000000; in other words, a positive sign value (the first bit) followed by the bit inverse of +127.

Note When a mu-law country connects with an a-law country, the mu-law end must convert its signal.

Uncompressed digital speech signals are sampled at a rate of 8000 samples per second, with each sample consisting of 8 bits. This corresponds to 64 kbps per call. Multiple algorithms have been developed to allow voice transmission at lower bandwidth consumption. The most common coder-decoder (codec) algorithms are presented in Table below together with their bandwidth.

VoIP Packetization

After the voice wavelength is digitized, the DSP collects the digitized data for an amount of time until there is enough data to fill the payload of a single packet.

The example in Figure belowshows how PCM samples are packaged into the payload of a single packet using the G.711 codec. With G.711, either 20 ms or 30 ms worth of voice wavelength is transmitted in a single packet. 20 ms worth of voice wavelength corresponds to 160 samples (at 8000 samples per second, 10 ms would correspond to 80 samples, and 20 ms would be 160 samples). With 20 ms worth of voice wavelength, 50 VoIP packets are transmitted in each direction in 1 second (1 second consists of 50 20-ms intervals: 1 sec / 20 ms = 50). Similarly, 30 ms worth of voice wavelength corresponds to 240 samples (at 8000 samples per second, 10 ms would equal 80 samples, and 30 ms would be 240 samples). With 30 ms worth of voice, approximately 33 VoIP packets are transmitted in each direction in 1 second (1 second consists of 33.[3] 30-ms intervals: 1 sec / 30 ms = 33.[3]).

Packetization Rate

The length of voice information carried in a single packet affects the payload size, which is referred to in Table 2-4 as the size of collected G.711 samples for a single packet. Before the payloads are transmitted over the IP network, they must be encapsulated in a packet that introduces an additional overhead caused by Open Systems Interconnection (OSI) Layers 3 and above. These headers consume additional bandwidth, in addition to the 64 kbps required for raw voice transmission. according to the formula

Bandwidth_per_call = (Voice_payload + Layer 3_overhead + Layer 2_overhead) *PACKET_ratio) * 8 bits/byte

The bandwidth overhead depends on packet rate as shown below:

20 ms Voice length in a packet 30 ms voice length in a packet 40 ms voice length a packet 60 ms voice length a packet 80 ms voice length a packet
Packetization Rate 50 pps 33.3 pps 25 pps 16.7 pps 12.5 pps
Size of collected G.711 samples for a single packet 160 bytes 240 bytes 320 bytes 480 bytes 640 bytes
raw voice bandwidth
64 kbps 64 kbps 64 kbps 64 kbps 64 kbps
Layer 3+ uncompressed VoIP bandwidth 80 kbps 74.7 kbps 72 kbps 69.3 kbps 68 kbps
Codec operations

G.729 is presented in this example. The DSP samples, quantizes, and encodes the analog waveform at the input. The DSP generates one codeword for each 10 ms worth of voice. The codewords are encapsulated in the payload of VoIP packets. A single VoIP packet carries by default 20 ms worth of audio, encapsulating two G.729 codewords in one payload. Another supported packetization rate is 30 ms, in which the VoIP packets are generated every 30 ms and carry three G.729 codewords in each packet.


VoIP media transmission

In a VoIP network, the actual voice conversations are transported across the transmission media using RTP and RTCP, or its derivatives, SRTP and cRTP. RTP defines a standardized packet format for delivering audio and video over the Internet. RTCP is a companion protocol to RTP, and provides for the delivery of control information for individual RTP streams.

cRTP and SRTP were developed to enhance the use of RTP.

Datagram protocols, such as UDP, send the media stream as a series of small packets. This is simple and efficient; however, packets can be lost or corrupted in transit. Depending on the protocol and the extent of the loss, the client might be able to recover the data with error correction techniques, might interpolate over the missing data, or might suffer a data dropout. RTP and RTCP were specifically designed to stream media over networks. They are both built on top of UDP.

The following lists the primary protocols involved in voice media transmission:

Real-time Transport Protocol (RTP): Delivers the actual audio and video streams over networks
Real-time Transport Control Protocol (RTCP): Provides out-of-band control information for an RTP flow
Compressed RTP (cRTP): Compresses IP/UDP/RTP headers on low-speed serial links
Secure RTP (SRTP): Provides encryption, message authentication and integrity, and replay protection to RTP

Real-Time Transport Protocol

RTP, described in RFC 3550, defines a standardized packet format for delivering audio and video over an IP network.

RTP typically runs on top of UDP so that it can use the multiplexing and checksum services of that protocol. RTP applications are typically sensitive to delays; so, UDP is a better choice than the more complex TCP. RTP does not have a standard port on which it communicates.

The only standard that it obeys is that UDP communications are done via an even port, and the next higher odd port is used for RTCP communications. Although there are no standards assigned, RTP commonly uses ports 16384 to 32767. The fact that RTP uses a dynamic port range makes it difficult for it to traverse firewalls.

The functions of RTP include the following:

Payload type identification, which identifies the type of payload carried in the packet, such as codec, or media format. This identifier allows the changing of codecs and data formats while the flow is active, as is the case with fax and modem pass-through.
Sequence numbering, which monitors the sequence of arriving packets and is primarily used to detect packet loss. RTP does not request retransmission if a packet is lost.
Time stamping, which is necessary to place the arriving packets in the correct timing order. The dejitter buffer evaluates this parameter when compensating the variable path delay.

RTP supports both unicast and multicast transmission. In addition to the roles of sender and receiver, RTP also defines the roles of translator and mixer to support the multicast requirements.

Voice Gateway call legs

A voice call over a packet or traditional telephony network is segmented into discrete call legs. When a gateway receives a call setup, it performs a routing decision and sends the call setup request to the next device. The incoming part of the call is referred to as the incoming call leg and the outgoing part of the call is referred to as the outgoing call leg.

On Cisco IOS routers, the call legs are associated with dial peers. One dial peer corresponds to one call leg. A call leg is a logical connection between two gateways (routers) or between a gateway and a telephony device. If the gateway receives or forwards the call over an analog or digital voice circuit, the corresponding call leg is referred to as POTS. If the gateway receives or forwards the call over an IP interface, the corresponding call leg is referred to as VoIP.

The call legs are relevant for call routing. Before a gateway makes the call-routing decision, it must apply the settings defined in the incoming call leg. In the case of POTS incoming call legs, these parameters define how the gateway collects the dialed digits and optional applications. In the case of VoIP incoming call legs, these parameters describe the voice transmission methods, such as codec, voice activity detection (VAD), and dualtone multifrequency (DTMF)-related features.  These parameters must be successfully negotiated between the local and preceding gateway before the call can be forwarded to the next gateway in the path.

Voice-switching Gateway

A voice-switching gateway, as depicted in Figure, has traditional telephony interfaces. Multiple call-signaling protocols exist, such as SS7, ISDN, Q Signaling (QSIG), and the analog signaling methods, including supervisory signaling (loop-start, ground-start, immediate-start, wink-start, delay-start), address signaling (pulse, DTMF), and informational signaling. The voice-switching gateway receives and forwards the call setup request over analog or digital voice circuits. The gateway might have to convert the call signaling and the voice format when the call traverses the gateway from one port to another. The incoming and the outgoing call legs are the POTS call legs.

VoIP gateway

The gateway provides translation between VoIP and non-VoIP networks, such as the PSTN. It converts the signaling and voice signal between traditional telephony circuits and the VoIP transmission in an IP network.

One of the call legs is a POTS call leg, while the other is a VoIP call leg.

The VoIP terminating gateway has the VoIP incoming call leg and the POTS outgoing call leg. Both gateways must first successfully negotiate the VoIP parameters associated with their respective outgoing and incoming call legs before the VoIP terminating gateway can forward the call to the destination PSTN network.

Cisco Unified Border Element

Cisco Unified Border Element, as illustrated in Figure below, forwards an incoming VoIP call as another, outgoing VoIP call. It receives a call setup request, negotiates parameters, and forwards the call setup request to the next gateway. The incoming signaling protocol might differ from the outgoing signaling protocol. When the call is successfully signaled end to end, Cisco UBE might either proxy the media channel, which is referred to as
flow-through, or let the media channel pass through the gateway without any modification, which is referred to as flow-around.

The media proxy function is necessary when the VoIP traffic parameters of the incoming call leg differ from the VoIP parameters of the outgoing call leg. When Cisco UBE proxies the media channel, it changes the IP addresses of the media packets. This feature is very useful for security or connectivity reasons. Both call legs of a Cisco UBE are VoIP call legs.



How Voice Gateways Route calls

A comparison of IP packet routing and call-routing is shown below:

IP routing Call routing
Static or dynamic Only Static
IP routing table Dial Plan
IP route Dial peer
Hop-by-hop routing, where each router makes an independent decision Inbound and outbound call legs, where the gateway
negotiates VoIP parameters with preceding and next
gateways before a call is forwarded
Destination-based routing Called number, matched by destination pattern, is
one of many selection criteria
Longest-match rule The longest-match rule used for a dial peer’s destination pattern exists
Equal paths Preference can be applied to equal dial peers, or a random selection is made if all criteria are the same
Default route Possible to have a default route, which often points to a gatekeeper

Dial peers are essential to implementing dial plans and providing voice services over an IP packet network. Dial peers are used to identify call source and destination endpoints and to define the characteristics that are applied to each call leg in the call connection.

Dial Peers

Dial peers are essential to implementing dial plans and providing voice services over an IP packet network. Dial peers are used to identify call source and destination endpoints and to define the characteristics that are applied to each call leg in the call connection.

Type of Dial Peer   Network Technology
Plain old telephone service (POTS) Maps a dial string to a specific voice port on the local gateway. The voice port connects the gateway to the PSTN, PBX, or analog telephone.
VoIP Points to the IP address or DNS name of the destination VoIP
device that terminates the call. This mapping applies to VoIP protocols, such as H.323 and SIP
Multimedia Mail over IP (MMoIP) The dial peer is mapped to the email address of the SMTP server.
This type of dial peer is used for store-and-forward fax (on-ramp
and off-ramp faxing).


POTS dial peer:

In figure below, an analog telephone is connected to the Cisco Unified Communications gateway. The gateway needs two dial peers.  The POTS dial-peer configuration includes at least the telephone number of the analog telephone and the voice port to which it is attached. Based on this information, the gateway forwards calls destined to the defined
telephone over the specified port.

To successfully forward calls in both directions, at least these call-routing elements are needed in every voice-processing system:

An appropriate POTS dial peer that specifies to which voice port the telephone is attached. This applies only to the edge voice-processing systems(Usually a phone).
An appropriate VoIP dial peer that specifies the recipient destination address, or at least the address of the next hop

Dial-peer parameters vary based on the dial-peer type.  A VoIP dial peer can point to either an H.323 or SIP device.

VoIP dial-peer parameters include coder-decoder (codec), quality of service (QoS), voice activity detection (VAD), dual-tone multifrequency (DTMF) relay, and fax rate.

Call legs

Call legs are router-centric. When an inbound call arrives on a gateway, the gateway finds the inbound dial peer and processes its settings. If the settings are acceptable, the gateway finds the outbound dial peer, establishes the outgoing call leg, and the call is switched from the incoming call leg to the outgoing call leg. You need to configure dial peers to enable call routing on a gateway.

Because dial peers collectively define where to forward calls, all dial peers together build a dial plan, which is equivalent to the IP routing table. The dial peers are static in nature.

Hop-by-hop call routing builds on the principle of call legs. Before a call-routing decision is made, the gateway must identify the inbound dial peer and process its parameters. This process might involve VoIP parameter negotiation.

The call-routing decision is the selection of the outbound dial peer. This selection is commonly based on the called number when the destination-pattern command is used. The selection might be based on other information, and that other criteria might have higher precedence than the called number. When the called number is matched to find the outbound dial peer, the longest-match rule applies.

If more than one dial peer equally matches the dial string, all the matching dial peers are used to form a rotary group. The router attempts to establish the outbound call leg using all the dial peers in the rotary group until one is successful. The selection order within the group can be influenced by configuring a preference value.

A default call route can be configured using special characters when matching the number.

The VoIP gateway is often faced with the task of selecting the best path for a given destination number. Such a requirement arises when the preferred path goes through the IP WAN, and the backup PSTN path should be chosen when the IP WAN is either unavailable or lacks the needed bandwidth resources.

Figure below illustrates a scenario with two locations connected to the IP WAN and PSTN. When the call goes through the PSTN, its numbers (both calling and called) might have to be manipulated so that they are reachable within the PSTN network. Otherwise, the PSTN switches will not recognize the called number, and the call will fail.

Figure below illustrates the call legs that are processed on a gateway that receives a call from a locally attached telephone and originates a VoIP session.
These call legs are created when the telephone (1001) attached to an R1 gateway dials a telephone number in another location (2001). When a call arrives on R1, the gateway creates an inbound call leg that corresponds to the inbound dial peer, makes a routing decision by finding an outbound dial peer, and creates an outbound call leg by forwarding the call toward the destination. If the routing decision chooses an IP WAN, the outbound call leg will be VoIP; if the routing decision chooses a PSTN, the outbound call leg will be POTS.

Figure below illustrates the call legs that are processed on the gateway that terminates the VoIP session and forwards the call to the locally attached telephone with extension 2001. The inbound call leg is created when the call arrives either through the IP WAN or the PSTN network. The gateway makes the routing decision by selecting the outbound dial peer. The outbound call leg corresponds to a POTS dial peer that points to the voice port  1/0/0, where the recipient’s telephone is attached. The gateway signals an incoming call on that port, and the telephone rings.