While using several of our senses in day-to-day life, we’re still limited by at-most one or two in our digital life… but behind the scenes researchers are working on the answer: multimodal interaction. Multimodal interaction gives user the liberty to interact with a system via more than a single human sense (touch, speech gesture etc). It is very much different from the traditional uni-modal interface where the interaction happens through one channel throughout. Multi-modal interaction has spread its wings to accommodate a wide gamut of users with a wide variety of profiles comprising of various ages. It has also addressed the issue of accessibility and universal use. The question now remains are how far have we come and where do we go now from here.
Introduction & need of Multimodal interfaces
Remember December 9, 1968 when Douglas C. Engelbart made a first public debut of a cubical shaped block of plastic and used it for pointing and clicking? It went on to be called a mouse. Much has changed since then by the inclusion of multiple buttons on that plastic block and a scroll wheel at the center. Lasers, Bluetooth, and wireless technology rule the domain instead of the traditional wheel dancing around and getting stuck every now and then. The revolution was brought upon by the need of different interaction techniques as progress was made in the technological arena and human beings had to convey and receive data to/from the machine in multiple ways.
In the last decade or so, the human-machine interaction has become more and more intricate. This is owed to the large amount of information coming along our way via multiple channels. The ever growing knowledge and the complexity of interaction have augmented the need for new ways of communication and interaction. The so called “information deluge” has been approached by researchers via integrating perceptual capabilities such as speech, vision, lip movement recognition etc. to the system. This type of interaction has been termed as Multi-modal interaction.
In multi-modal style of interaction the system processes more than one input mode – such as speech, pen based input, touch, gestural information – in a parallel processing approach. This has given birth to a new paradigm in the world of computing and has crossed the limits of the traditional WIMP interfaces ((WIMP: Windows, Icons, Menu, Pointing Devices. An interaction style developed by XEROX and now used in all major UIs)) It has not only divided the information across multiple channels but has also allowed the human being to interact with the system in a more naturalistic kind of approach, where voice, gesture, pen based form a big part. One of the very simplest examples of such kind of system is a mobile phone which can record and react to touch, voice and tactical class of inputs.
Comparison with WIMP style interaction
The traditional WIMP interfaces have the basic premise that the information can flow in and out of the system through a single channel or an event stream. This event stream can be in the form of input (mouse, keyboard etc) where the user enters data to the system and expects a feedback, or in the form of output (voice, visual etc) when the system responds. But the channel maintains its singularity and can process information one at a time. For example, in today’s interaction the computer ignores the typed information (through a keyboard), when a mouse button is depressed.
This is very much different from a multimodal interaction where the system has multiple event streams and channels and can process information coming through various input modes acting parallel, such as those described above. For e.g in an IVR system, a user can either type of speak to navigate through the menu.
The image on the left signifies a traditional unimodal system ((http://www.tribox.org/wp-content/uploads/2007/10/telephone-rotary.jpg)) whereas the one on the right ((http://www.stereoscopy.com/news/news-archive-8-2001.html)) denotes a multi-modal interaction.
Another noticeable difference lies in the fact that when traditional WIMP interfaces reside on a single machine, multimodal systems are generally spread across multiple networks and systems which all perform their specific action like speech processing, gesture recognition etc.
History & Recent advances in Multi-modal interaction
Richard A. Bolt (Architecture Machine Group, Massachusetts Institute of Technology) first introduced the concept of multi modal interaction in 1980. His research was called “Put that there” where he demonstrated a system which processed speech and touch pad pointing performing in a parallel way.One of the several examples which he demonstrated was by giving commands to “Create a blue square there”, with the intended location of “there” being pointed by a 2 dimensional cursor mark on the screen.
Since then, many advances have been done in the way systems have changed in the way they process input. Simple mouse based and pen based inputs have been replaced by a richer way of communication by incorporating semantic information. Systems have become more complicated internally and now process dual modes of input. Keyboard and mouse based interactions have been disposed off. They have given way to lip movement recognition, gestural recognition systems and speech recognition systems.
An example of a multimodal system has been developed at AT&T labs in the form of a navigation system. ((http://www.research.att.com/~johnston/))
Uses of Multimodal system
The very nature of multimodal systems has expanded the world of computing by encompassing a more diverse spectrum of users and more unfavorable usage conditions. The introduction of this type of interaction has also given a choice to the users for which modality they want to use while interacting with a system. This proves to be very helpful as users have varied preferences while interacting with the system as to how do they wish to communicate. Apart from interaction preferences, multimodal interaction has also addressed a wide variety of individual differences such as different age groups, skill sets, mother tongue, cognitive styles, sensory impairments and so on.
Example: A visually impaired person could use speech to communicate to the system. For faster responses, he/she could also use gestures.
Another very common example of multimodal interaction is seen while driving and using a mobile phone. Distracting one-self from the road ahead, and dialing a number on the mobile phone could be very dangerous. The driver instead uses voice input to dial a number by suggesting a name already stored or speaking out a whole number. The system also gives auditory feedback so the driver does not have to look at the screen.
So what do we learn? How soon could we expect such an interaction being a part of our daily lives? Or before that, are we even ready to adapt? Or is the technology ready to make the interaction transparent? How does usability come into picture? All these questions are being answered. Research all around the world are going on to overcome challenges which are being faced.
The current state of research has not produced robust implementations of algorithms and recognition based techniques. Further development of such interaction also involves a seamless behavior and synchronization between the input and output channels so that the user interacting does not enter a confused state. Another challenge would be to make the technology as much transparent as turning a knob of our living room. As much as the user is “aware” of the interaction, the less productive and efficient the interaction becomes.