Definitions – Multimodal Chatbot

A multimodal chatbot is a communication agent that works with
multiple modalities, such as text, voice, face, and body. It is
designed for interaction with multiple users at a time and can
perform tasks such as face detection, emotion classification,
tracking of crowd movement through mobile phones, and real-time
conversation to guide users through a nonlinear story and
interactive games.

Multimodal AI, which these chatbots are based on, is a type of
artificial intelligence (AI) that can process, understand and/or
generate outputs for more than one type of data. Examples of data modalities include text, images, audio, and video. Most AI systems
today are unimodal, meaning they are designed and built to work
with one type of data exclusively. In contrast, multimodal
architectures that can integrate and process multiple modalities
simultaneously have the potential to produce more than one type
of output.

For instance, a multimodal chatbot could utilize GUIs in dialogue,
improving accuracy and user satisfaction, and reducing time to
task completion. They can also process different kinds of input,
including images and sounds, making them more versatile and
powerful than traditional chatbots. A great multimodal experience
is one that feels seamless, easily switching out contexts.

Related