antecipate: Jingle media relaying

In an ideal Internet, each device would have a routable IP address all devices would be able to communicate end to end without any intermediaries except routers. In reality, devices connected on the Internet are using a NAT (Network Address Translation) function present in the border router. Using NAT, it becomes possible to connect multiple devices to the Internet by only using one public IP address. On the other hand, it becomes impossible to initiate connections from the Internet. Traversing NAT in both directions becomes an issue when doing point-to-point communications. This is particularly true when using RTP for multimedia communications.

A device behind NAT does not know much about how it will be seen from the Internet, it only knows its own IP address and the ports where the application runs. When communication with the Internet is established, the NAT function maps the IP:port combination of the device on the private NAT interface to a temporary public IP:port combination on the public interface connected to the Internet. Furthermore, the RTP transport protocol usually uses a random port. This means that users cannot just open a port on their NAT device for RTP.

Media consists of one or multiple streams which are negotiated in an associated signaling, such as SIP or Jingle. The signaling protocol allows devices to negotiate a set of common media. The negotiation is performed conveying information about the media streams, such as address where the media will be received, codec types, bandwidth, etc... The problem is that the signaling conveys information about the private IP of the device when it is behind NAT. There are two ways to solve this issue.

One is using a signaling protocol able to negotiate dynamically a communication path for the media even after the initial session has been setup. ICE (Interactive Connection Establishment) is such a protocol, which allows devices to probe for multiple paths of communication by trying different ports and STUN techniques. With ICE support devices have a good chance to handle point-to-point communication without any intermediary media relay. But ICE is awaiting full specification, and therefore only experimental support is provided. In addition, depending on the type of NAT, the communication might not be established even when using ICE. In this case a media relay proxy with a public Internet address must be used.

To transparently establish a multimedia session through a media relay proxy, it is best to use a service that associate the media proxy with a signaling proxy. The media relay proxy does the actual RTP traffic forwarding between the parties involved in the conversation. Upon request from the signaling proxy, it allocates sockets for each media stream of a session. The signaling proxy will use the media relay proxy's IP address and socket's port to replace the original values in the signaling payload. For SIP this would be achieved by modifying the SDP payload, for Jingle this would require changing the transport candidate. After this is done, the parties involved in the conversation will contact the media relay proxy thinking they contact the other party.

This approach is needed because the media relay proxy would then be able to determine the addresses from where the media streams originate. This information is unknown when the signaling takes place, and can only be determined when the RTP streams actually start.
After the media relay proxy has allocated the sockets for each stream, it will listen for an incoming packet from each of the two parties. Once these are received, the media relay proxy is able to know where the packets should be forwarded and can start relaying them between the parties. However, if one party has a public Internet address, the media relay proxy is able to send packets to it before it receives a packet from it, since the party's IP:port is already known. Because of this, it becomes possible to chain media relay proxies between them.

It is interesting to note that the media relay proxy solution is independent from the actual signaling protocol. Several solutions already exist for SIP, with the added complexity that SIP itself will require a NAT traversal solution when transported over UDP. I will describe how a media relay proxy can be implemented in XMPP using the Jingle signaling. As explained above, the media relay proxy must be on the public Internet. The easiest approach would in my opinion consist in implementing the media relay proxy as a component of an XMPP server, and install it into a DMZ. Doing so has the advantage of leveraging the trust relationship between the Jingle client and the XMPP server and extending it to the media relay proxy.

The XMPP server would have to be modified to route the incoming Jingle traffic through the component, which will in turn intercept and modify the Jingle transport negotiation payloads:

For raw UDP transport, the component will replace the original transport candidate using the IP:port of a newly created socket.
For an ICE transport, the component will create a new candidate using the IP:port of a newly created socket, and discard any other candidate directly by either parties. As the proxy is always reachable, this connection will always be established.

This example demonstrate that existing multimedia NAT traversal techniques can easily be adapted for Jingle, with the added advantage that the Jingle signaling itself is NAT and firewall friendly, which is not the case of SIP. This use case can also be extended by support both SIP and Jingle on the same media relay proxy component to provide seamless media connectivity between SIP and Jingle clients.