Complex, did you say complex?
Interesting sum-up by Peter St Andre of the current reflection on “multimedia features” following is participation to the latest ITU H.325 workshop. It is an appreciable contribution to the true Jingle requirements. I nonetheless believe this sum-up only capture part of the landscape, and thus the XMPP community may miss some of the derived requirements. I will try to address the first important perception expressed by this knowledgeable crowd at the workshop.
The major resistance to wide adoption of multimedia technologies lies in the “complexity” of the products proposed to the end user. Really? Let me list only a few reasons why this is the case:
- use a technological view point instead of solving the end user issues,
- leave only the geeks in control,
- think ‘inside’ a particular media or a particular business and ignore the rest of the world,
- make one’s own specialized baby protocol the cure for everything,
- (add your own item) …
Nothing really new here, but this short list already explains why there is a widespread felling of complexity. And as the proposed technology does not solve real world issues, the end user is not compelled to invest its time overcoming this complexity. In a context of commoditized service, such as voice calls, the winners will be the one-button phone. Not the average SIP soft-phone provided to you at a fee by the large telecom vendors (those who have tried to set-up a SIP voice phone understand what I am talking about) and this is what the original GTalk client has been good at. Anything more complex than this is bound to be a failure as a soft phone.
The second hyped concept put forward by the learned assembly is the quest for the “killer app” holly grail. This is exposed in Peter’s point 6:
Imitating the PSTN (voice calling) is not enough -- we need true multimedia innovation (whiteboarding, collaborative editing, app sharing, etc.).
I can see several hurdles making this an unrealistic wish. For one, the first natural way of communication for a human being is voice. For one, if you do not provide the one button phone user experience, it is unlikely that the more complex application will be tried. Once it has been provided, then the technology winners will be the 2 clicks web-conference service, the hidden presence enabled collaborative editor. I said technology winners on purpose. There is a widespread hype about the “value” of introducing these technologies in a corporate, remote worker or individual environment, but no tangible data yet.
For second, if the underlying technological enablers are too complex and too monolithic, then the probability for the “killer app” to ever appear is greatly reduced. Just think of the web-conference example. You just need to combine SIP for the voice, part of the H.323 framework for the video, some proprietary protocol for the browser space sharing, some other for the instant messaging chat feature. And this is just the tip of the iceberg.
When setting up the requirements to give an answer to this complexity, I believe one must keep in mind that this technology is about communication. When defining a protocol to support a communication technology, it is really important to keep a few functional facts in mind. The resulting protocol bundle :
- must ensure media inter-operability and define a minimum set of mandatory capabilities and enablers. Today, it is really wrong that no codec be required, for example.
- must provide excellent discovery and feature negotiation mechanisms. Being able to negotiate capabilities is not enough, you also need a way to find the negotiating party.
- must ensure through adequate mechanisms the end to end trust level , auditability, privacy and accountability of the communication.
- must be extensible and aggregative. This is the usual reference to modularity. To reduce complexity, and enable the “killer app”, one needs building blocks without overlay. Today, the respective responsibility of signaling, media definition and media transport in a SIP based system is completely fuzzy. Part of the signaling is in SIP, part in RTCP. Part of the call control is in SIP, part in the SDP payload.
As an illustration for the XMPP world, let me go back to the DTMF JEP for Jingle. This is the typical example of what I believe should be avoided. DTMF is only a way to use tones as commands to certain phone services actions. Wanting DTMF in Jingle falls directly under “imitating the PSTN”, and is the inevitable result of a “geek only” approach (don’t worry, I am also a geek sometime). This provides a legacy technology adaptation. Not a solution to an end user issue! The end user does not want to know about DTMF, it only wishes to interact with a service. If for example, the service is a voice mail application, then the end user will want to manage its mail box. In this case the proper approach is to leverage the XMPP flexible offline message retrieval extension, and not DTMF. That a Jingle enabled voice mail proxy needs to convert this XMPP extension stanzas into DTMF to interact with a legacy tone only voice-mail implementation must be entirely transparent to the end user.
If the service is an IV-R application, then the Jingle enabled service will convert the VoiceML instructions into a set of XMPP data extensions that will be presented to the end user for a richer experience. Again DTMF is not needed in Jingle for this. All the complexity of dealing with legacy systems must be hidden in the proxy or service implementation without making its way into the protocol.