ICMI '19- 2019 International Conference on Multimodal Interaction

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

A Brief History of Intelligence

Intelligence is the deciding factor of how human beings become the most dominant life forms on earth. Throughout history, human beings have developed tools and technologies which help civilizations evolve and grow. Computers, and by extension, artificial intelligence (AI), has played important roles in that continuum of technologies. Recently artificial intelligence has garnered much interest and discussion. As artificial intelligence are tools that can enhance human capability, a sound understanding of what the technology can and cannot do is also necessary to ensure their appropriate use. While developing artificial intelligence, we also found out the definition and understanding of our own human intelligence continue evolving. The debates of the race between human and artificial intelligence have been ever growing. In this talk, I will describe the history of both artificial intelligence and human intelligence (HI). From the great insights of the such historical perspectives, I would like to illustrate how AI and HI will co-evolve with each other and project the future of AI and HI.

Challenges of Multimodal Interaction in the Era of Human-Robot Coexistence

With the rapid progress in computing and sensory technologies, we will enter the era of human-robot coexistence in the not-too-distant future, and it is time to address the challenges of multimodal interaction. Should a robot take the form of humanoid? Is it better for robots to behave as a second-class citizen or as an equal part of the society as human? Should the communication between human and robot be symmetric or is it okay to be asymmetric? And how about the communication between robots with human presence? What does it mean by emotional intelligence for robots? With the inevitable physical interaction between human and robot, how to guarantee safety? What is the ethical and moral model for robots and how do they follow?

Connecting Humans with Humans: Multimodal, Multilingual, Multiparty Mediation

Behind much of my research work over 4 decades has been the simple observation that people like people and love interacting with other people more than they like interacting with machines. Technologies that truly support such social desires are more likely to be adopted broadly. Consider email, texting, chat rooms, social media, video conferencing, the internet, speech translation, even videogames with a social element (e.g., Fortnite): we enjoy the technology whenever it brings us closer to our fellow humans, instead of imposing attention-grabbing clutter. If so, how then can we build better technologies that improve, encourage, support human-human interaction? In this talk, I will recount my own story along this journey. When I began, building technologies for the human-human experience, presented formidable challenges: Computer interfaces would need to anticipate and understand the way humans interact, but in 1976, a typical computer had only two instructions to interact with humans: character-in & character-out, and both only supported human-computer interaction. Over the decades that followed, we began to develop interfaces that can process the various modalities of human communication and we built systems that used several modalities in services to improve human-human interaction. These included:

In my talk, I will discuss the challenges of interpreting multimodal signals of human-human interaction in the wild. I will show the resulting human-human systems, we developed and how to make them effective. Some went on to become services that affect the way we work and communicate today.

Socially-Aware User Interfaces: Can Genuine Sensitivity Be Learnt at all?

Recent years have initiated a paradigm shift from pure task-based human-machine interfaces towards socially-aware interaction. Advances in deep learning have led to anthropomorphic interfaces with robust sensing capabilities that come close to or even exceed human performance. In some cases, these interfaces may convey to humans the illusion of a sentient being that cares for them. At the same time, there is the risk that - at some point - these systems may have to reveal their lack of true comprehension of the situative context and the user’s needs with serious consequences to user trust. The talk will discuss challenges that arise when designing multimodal interfaces that hide the underlying complexity from the user, but still demonstrate a transparent and plausible behavior. It will argue for hybrid AI approaches that look beyond deep learning to encompass a theory of mind to obtain a better understanding of the rationale behind human behaviors.

SESSION: Session 1: Human Behavior

Multi-modal Active Learning From Human Data: A Deep Reinforcement Learning Approach

Human behavior expression and experience are inherently multimodal, and characterized by vast individual and contextual heterogeneity. To achieve meaningful human-computer and human-robot interactions, multi-modal models of the user’s states (e.g., engagement) are therefore needed. Most of the existing works that try to build classifiers for the user’s states assume that the data to train the models are fully labeled. Nevertheless, data labeling is costly and tedious, and also prone to subjective interpretations by the human coders. This is even more pronounced when the data are multi-modal (e.g., some users are more expressive with their facial expressions, some with their voice). Thus, building models that can accurately estimate the user’s states during an interaction is challenging. To tackle this, we propose a novel multi-modal active learning (AL) approach that uses the notion of deep reinforcement learning (RL) to find an optimal policy for active selection of the user’s data, needed to train the target (modality-specific) models. We investigate different strategies for multi-modal data fusion, and show that the proposed model-level fusion coupled with RL outperforms the feature-level and modality-specific models, and the naïve AL strategies such as random sampling, and the standard heuristics such as uncertainty sampling. We show the benefits of this approach on the task of engagement estimation from real-world child-robot interactions during an autism therapy. Importantly, we show that the proposed multi-modal AL approach can be used to efficiently personalize the engagement classifiers to the target user using a small amount of actively selected user’s data.

Comparing Pedestrian Navigation Methods in Virtual Reality and Real Life

Mobile navigation apps are among the most used mobile applications and are often used as a baseline to evaluate new mobile navigation technologies in field studies. As field studies often introduce external factors that are hard to control for, we investigate how pedestrian navigation methods can be evaluated in virtual reality (VR). We present a study comparing navigation methods in real life (RL) and VR to evaluate if VR environments are a viable alternative to RL environments when it comes to testing these. In a series of studies, participants navigated a real and a virtual environment using a paper map and a navigation app on a smartphone. We measured the differences in navigation performance, task load and spatial knowledge acquisition between RL and VR. From these we formulate guidelines for the improvement of pedestrian navigation systems in VR like improved legibility for small screen devices. We furthermore discuss appropriate low-cost and low-space VR-locomotion techniques and discuss more controllable locomotion techniques.

Video and Text-Based Affect Analysis of Children in Play Therapy

Play therapy is an approach to psychotherapy where a child is engaging in play activities. Because of the strong affective component of play, it provides a natural setting to analyze feelings and coping strategies of the child. In this paper, we investigate an approach to track the affective state of a child during a play therapy session. We assume a simple, camera-based sensor setup, and describe the challenges of this application scenario. We use fine-tuned off-the-shelf deep convolutional neural networks for the processing of the child’s face during sessions to automatically extract valence and arousal dimensions of affect, as well as basic emotional expressions. We further investigate text-based and body-movement based affect analysis. We evaluate these modalities separately and in conjunction with play therapy videos in natural sessions, discussing the results of such analysis and how it aligns with the professional clinicians’ assessments.

Facial Expression Recognition via Relation-based Conditional Generative Adversarial Network

Recognizing emotions by adapting to various human identities is very difficult. In order to solve this problem, this paper proposes a relation-based conditional generative adversarial network (RcGAN), which recognizes facial expressions by using the difference (or relation) between neutral face and expressive face. The proposed method can recognize facial expression or emotion independently of human identity. Experimental results show that the proposed method provides higher accuracies of 97.93% and 82.86% for CK+ and MMI databases, respectively than conventional method.

Continuous Emotion Recognition in Videos by Fusing Facial Expression, Head Pose and Eye Gaze

Continuous emotion recognition is of great significance in affective computing and human-computer interaction. Most of existing methods for video based continuous emotion recognition utilize facial expression. However, besides facial expression, other clues including head pose and eye gaze are also closely related to human emotion, but have not been well explored in continuous emotion recognition task. On the one hand, head pose and eye gaze could result in different degrees of credibility of facial expression features. On the other hand, head pose and eye gaze carry emotional clues themselves, which are complementary to facial expression. Accordingly, in this paper we propose two ways to incorporate these two clues into continuous emotion recognition. They are respectively an attention mechanism based on head pose and eye gaze clues to guide the utilization of facial features in continuous emotion recognition, and an auxiliary line which helps extract more useful emotion information from head pose and eye gaze. Experiments are conducted on the Recola dataset, a database for continuous emotion recognition, and the results show that our framework outperforms other state-of-the-art methods due to the full use of head pose and eye gaze clues in addition to facial expression for continuous emotion recognition.

Effect of Feedback on Users’ Immediate Emotions: Analysis of Facial Expressions during a Simulated Target Detection Task

Safety-critical systems (e.g., UAV systems) often incorporate warning modules that alert users regarding imminent hazards (e.g., system failures). However, these warning systems are often not perfect, and trigger false alarms, which can lead to negative emotions and affect subsequent system usage. Although various feedback mechanisms have been studied in the past to counter the possible negative effects of system errors, the effect of such feedback mechanisms and system errors on users’ immediate emotions and task performance is not clear. To investigate the influence of affective feedback on participants’ immediate emotions, we designed a 2 (warning reliability: high/low) × 2 (feedback: present/absent) between-group study where participants interacted with a simulated UAV system to identify and neutralize enemy vehicles under time constraint. Task performance along with participants’ facial expressions were analyzed. Results indicated that giving feedback decreased fear emotions during the task whereas warning increased frustration for high reliability groups compared to low reliability groups. Finally, feedback was found not to affect task performance.

SESSION: Session 2: Artificial Agents

Multimodal Analysis and Estimation of Intimate Self-Disclosure

Self-disclosure to others has a proven benefit for one’s mental health. It is shown that disclosure to computers can be similarly beneficial for emotional and psychological well-being. In this paper, we analyzed verbal and nonverbal behavior associated with self-disclosure in two datasets containing structured human-human and human-agent interviews from more than 200 participants. Correlation analysis of verbal and nonverbal behavior revealed that linguistic features such as affective and cognitive content in verbal behavior, and nonverbal behavior such as head gestures are associated with intimate self-disclosure. A multimodal deep neural network was developed to automatically estimate the level of intimate self-disclosure from verbal and nonverbal behavior. Between modalities, verbal behavior was the best modality for estimating self-disclosure within-corpora achieving r = 0.66. However, the cross-corpus evaluation demonstrated that nonverbal behavior can outperform language modality in cross-corpus evaluation. Such automatic models can be deployed in interactive virtual agents or social robots to evaluate rapport and guide their conversational strategy.

A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities

Embodied avatars as virtual agents have many applications and provide benefits over disembodied agents, allowing nonverbal social and interactional cues to be leveraged, in a similar manner to how humans interact with each other. We present an open embodied avatar built upon the Unreal Engine that can be controlled via a simple python programming interface. The avatar has lip syncing (phoneme control), head gesture and facial expression (using either facial action units or cardinal emotion categories) capabilities. We release code and models to illustrate how the avatar can be controlled like a puppet or used to create a simple conversational agent using public application programming interfaces (APIs). GITHUB link: https://github.com/danmcduff/AvatarSim

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar’s speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models.

Multitask Prediction of Exchange-level Annotations for Multimodal Dialogue Systems

This paper presents multimodal computational modeling of three labels that are independently annotated per exchange to implement an adaptation mechanism of dialogue strategy in spoken dialogue systems based on recognizing user sentiment by multimodal signal processing. The three labels include (1) user’s interest label pertaining to the current topic, (2) user’s sentiment label, and (3) topic continuance denoting whether the system should continue the current topic or change it. Predicting the three types of labels that capture different aspects of the user’s sentiment level and the system’s next action contribute to adopting a dialogue strategy based on the user’s sentiment. For this purpose, we enhanced shared multimodal dialogue data by annotating impressed sentiment labels and the topic continuance labels. With the corpus, we develop a multimodal prediction model for the three labels. A multitask learning technique is applied for binary classification tasks of the three labels considering the partial similarities among them. The prediction model was efficiently trained even with a small data set (less than 2000 samples) thanks to the multitask learning framework. Experimental results show that the multitask deep neural network (DNN) model trained with multimodal features including linguistics, facial expressions, body and head motions, and acoustic features, outperformed those trained as single-task DNNs by 1.6 points at the maximum.

Multimodal Learning for Identifying Opportunities for Empathetic Responses

Embodied interactive agents possessing emotional intelligence and empathy can create natural and engaging social interactions. Providing appropriate responses by interactive virtual agents requires the ability to perceive users’ emotional states. In this paper, we study and analyze behavioral cues that indicate an opportunity to provide an empathetic response. Emotional tone in language in addition to facial expressions are strong indicators of dramatic sentiment in conversation that warrant an empathetic response. To automatically recognize such instances, we develop a multimodal deep neural network for identifying opportunities when the agent should express positive or negative empathetic responses. We train and evaluate our model using audio, video and language from human-agent interactions in a wizard-of-Oz setting, using the wizard’s empathetic responses and annotations collected on Amazon Mechanical Turk as ground-truth labels. Our model outperforms a text-based baseline achieving F1-score of 0.71 on a three-class classification. We further investigate the results and evaluate the capability of such a model to be deployed for real-world human-agent interactions.

SESSION: Session 3: Touch and Gesture

Dynamic Adaptive Gesturing Predicts Domain Expertise in Mathematics

Embodied Cognition theorists believe that mathematics thinking is embodied in physical activity, like gesturing while explaining math solutions. This research asks the question whether expertise in mathematics can be detected by analyzing students’ rate and type of manual gestures. The results reveal several unique findings, including that math experts reduced their total rate of gesturing by 50%, compared with non-experts. They also dynamically increased their rate of gesturing on harder problems. Although experts reduced their rate of gesturing overall, they selectively produced 62% more iconic gestures. Iconic gestures are strategic because they assist with retaining spatial information in working memory, so that inferences can be extracted to support correct problem solving. The present results on representation-level gesture patterns are convergent with recent findings on signal-level handwriting, while also contributing a causal understanding of how and why experts adapt their manual activity during problem solving.

VisualTouch: Enhancing Affective Touch Communication with Multi-modality Stimulation

As one of the most important non-verbal communication channel, touch plays an essential role in interpersonal affective communication. Although some researchers have started exploring the possibility of using wearable devices for conveying emotional information, most of the existing devices still lack the capability to support affective and dynamic touch in interaction. In this paper, we explore the effect of dynamic visual cues on the emotional perception of vibrotactile signals. For this purpose, we developed VisualTouch, a haptic sleeve consisting of a haptic layer and a visual layer. We hypothesized that visual cues would enhance the interpretation of tactile cues when both types of cues are congruent. We first carried out an experiment and selected 4 stimuli producing substantially different responses. Based on that, a second experiment was conducted with 12 participants rating the valence and arousal of 36 stimuli using SAM scales.

TouchPhoto: Enabling Independent Picture Taking and Understanding for Visually-Impaired Users

This paper presents TouchPhoto, which provides visual-audio-tactile assistive features to enable visually-impaired users to take and understand photographs independently. A user can take photographs under auditory guidance and record audio tags to aid later recall of the photographs’ contents. For comprehension, the user can listen to audio tags embedded in a photograph while touching salient features, e.g., human faces, using an electrovibration display. We conducted two user studies with visually-impaired users, one for picture taking and the other for understanding and recall, in a two-month interval. They considered auditory assistance as very useful for taking and understanding photographs and tactile features as helpful but to a limited extent.

Creativity Support and Multimodal Pen-based Interaction

Creativity as a skill is associated with a potential to drive both productivity and psychological wellbeing. Since multimodality can foster cognitive ability, multimodal digital tools should also be ideal to support creativity as an essentially cognitive skill. In this paper, we explore this notion by presenting a multimodal pen-based interaction technique and studying how it supports creativity. The multimodal solution uses micro-controller-technology to augment a digital pen with RGB LEDs and a Leap Motion sensor to enable bimanual input. We report on a user study with 26 participants demonstrating that the multimodal technique is indeed perceived as supporting creativity significantly more than a baseline condition.

Motion Eavesdropper: Smartwatch-based Handwriting Recognition Using Deep Learning

This paper focuses on the real-life scenario that people are handwriting while wearing small mobile devices on their wrists. We explore the possibility of eavesdropping privacy-related information based on motion signals. To achieve this, we elaborately develop a new deep learning-based motion sensing framework with four major components, i.e., recorder, signal preprocessor, feature extractor and handwriting recognizer. First, we integrate a series of simple yet effective signal processing techniques to purify the sensory data to reflect the kinetic property of a handwriting motion. Then we take advantage of properties of Multimodal Convolutional Neural Network (MCNN) to extract abstract features. After that, a bidirectional Long Short-Term Memory (BLSTM) network is exploited to model temporal dynamics. Finally, we incorporate Connectionist Temporal Classification (CTC) algorithm to realize end-to-end handwriting recognition. We prototype our design using a commercial off-the-shelf smartwatch and carry out extensive experiments. The encouraging results reveal that our system can robustly achieve an average accuracy of 64% at character-level and 71.9% at word-level, and 56.6% accuracy rate for words unseen in the training set under certain conditions, which expose the danger of privacy disclosure in daily lives.

SESSION: Session 4: Physiological Modeling

Predicting Cognitive Load in an Emergency Simulation Based on Behavioral and Physiological Measures

The reliable estimation of cognitive load is an integral step towards real-time adaptivity of learning or gaming environments. We introduce a novel and robust machine learning method for cognitive load assessment based on behavioral and physiological measures in a combined within- and cross-participant approach. 47 participants completed different scenarios of a commercially available emergency personnel simulation game realizing several levels of difficulty based on cognitive load. Using interaction metrics, pupil dilation, eye-fixation behavior, and heart rate data, we trained individual, participant-specific forests of extremely randomized trees differentiating between low and high cognitive load. We achieved an average classification accuracy of 72%. We then apply these participant-specific classifiers in a novel way, using similarity between participants, normalization, and relative importance of individual features to successfully achieve the same level of classification accuracy in cross-participant classification. These results indicate that a combination of behavioral and physiological indicators allows for reliable prediction of cognitive load in an emergency simulation game, opening up new avenues for adaptivity and interaction.

Driving Anomaly Detection with Conditional Generative Adversarial Network using Physiological and CAN-Bus Data

New developments in advanced driver assistance systems (ADAS) can help drivers deal with risky driving maneuvers, preventing potential hazard scenarios. A key challenge in these systems is to determine when to intervene. While there are situations where the needs for intervention or feedback is clear (e.g., lane departure), it is often difficult to determine scenarios that deviate from normal driving conditions. These scenarios can appear due to errors by the drivers, presence of pedestrian or bicycles, or maneuvers from other vehicles. We formulate this problem as a driving anomaly detection, where the goal is to automatically identify cases that require intervention. Towards addressing this challenging but important goal, we propose a multimodal system that considers (1) physiological signals from the driver, and (2) vehicle information obtained from the controller area network (CAN) bus sensor. The system relies on conditional generative adversarial networks (GAN) where the models are constrained by the signals previously observed. The difference of the scores in the discriminator between the predicted and actual signals is used as a metric for detecting driving anomalies. We collected and annotated a novel dataset for driving anomaly detection tasks, which is used to validate our proposed models. We present the analysis of the results, and perceptual evaluations which demonstrate the discriminative power of this unsupervised approach for detecting driving anomalies.

Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning

Various psychological factors affect how individuals express emotions. Yet, when we collect data intended for use in building emotion recognition systems, we often try to do so by creating paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms trained with these types of data are unlikely to function outside of controlled environments because our emotions naturally change as a function of these other factors. In this work, we study how the multimodal expressions of emotion change when an individual is under varying levels of stress. We hypothesize that stress produces modulations that can hide the true underlying emotions of individuals and that we can make emotion recognition algorithms more generalizable by controlling for variations in stress. To this end, we use adversarial networks to decorrelate stress modulations from emotion representations. We study how stress alters acoustic and lexical emotional predictions, paying special attention to how modulations due to stress affect the transferability of learned emotion recognition models across domains. Our results show that stress is indeed encoded in trained emotion classifiers and that this encoding varies across levels of emotions and across the lexical and acoustic modalities. Our results also show that emotion recognition models that control for stress during training have better generalizability when applied to new domains, compared to models that do not control for stress during training. We conclude that is is necessary to consider the effect of extraneous psychological factors when building and testing emotion recognition models.

Multimodal Classification of EEG During Physical Activity

Brain Computer Interfaces (BCIs) typically utilize electroencephalography (EEG) to enable control of a computer through brain signals. However, EEG is susceptible to a large amount of noise, especially from muscle activity, making it difficult to use in ubiquitous computing environments where mobility and physicality are important features. In this work, we present a novel multimodal approach for classifying the P300 event related potential (ERP) component by coupling EEG signals with nonscalp electrodes (NSE) that measure ocular and muscle artifacts. We demonstrate the effectiveness of our approach on a new dataset where the P300 signal was evoked with participants on a stationary bike under three conditions of physical activity: rest, low-intensity, and high-intensity exercise. We show that intensity of physical activity impacts the performance of both our proposed model and existing state-of-the-art models. After incorporating signals from nonscalp electrodes our proposed model performs significantly better for the physical activity conditions. Our results suggest that the incorporation of additional modalities related to eye-movements and muscle activity may improve the efficacy of mobile EEG-based BCI systems, creating the potential for ubiquitous BCI.

SESSION: Session 5: Sound and interaction

”Paint that object yellow”: Multimodal Interaction to Enhance Creativity During Design Tasks in VR

Virtual Reality (VR) has always been considered a promising medium to support designers with alternative work environments. Still, graphical user interfaces are prone to induce attention shifts between the user interface and the manipulated target objects which hampers the creative process. This work proposes a speech-and-gesture-based interaction paradigm for creative tasks in VR. We developed a multimodal toolbox (MTB) for VR-based design applications and compared it to a typical unimodal menu-based toolbox (UTB). The comparison uses a design-oriented use-case and measures flow, usability, and presence as relevant characteristics for a VR-based design process. The multimodal approach (1) led to a lower perceived task duration and a higher reported feeling of flow. It (2) provided a higher intuitive use and a lower mental workload while not being slower than an UTB. Finally, it (3) generated a higher feeling of presence. Overall, our results confirm significant advantages of the proposed multimodal interaction paradigm and the developed MTB for important characteristics of design processes in VR.

VCMNet: Weakly Supervised Learning for Automatic Infant Vocalisation Maturity Analysis

Using neural networks to classify infant vocalisations into important subclasses (such as crying versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing in the way of progress lies in the datasets: The performance of a learning model is affected by the labelling quality and size of the dataset used, and infant vocalisation datasets with good quality labels tend to be small. In this paper, we assess the performance of three models for infant VoCalisation Maturity (VCM) trained with a large dataset annotated automatically using a purpose-built classifier and a small dataset annotated by highly trained human coders. The two datasets are used in three different training strategies, whose performance is compared against a baseline model. The first training strategy investigates adversarial training, while the second exploits multi-task learning as the neural network trains on both datasets simultaneously. In the final strategy, we integrate adversarial training and multi-task learning. All of the training strategies outperform the baseline, with the adversarial training strategy yielding the best results on the development set.

Evidence for Communicative Compensation in Debt Advice with Reduced Multimodality

Research has found that professional advice with empathy displays and signs of listening lead to more successful outcomes. These skills are typically displayed through visual nonverbal signals, whereas reduced multimodal contexts have to use other strategies to compensate for the lack of visual nonverbal information. Debt advice is often a highly emotional scenario but to date there has been no research comparing fully multimodal with reduced multimodal debt advice. The aim of the current study was to compare explicit emotional content (as expressed verbally) and implicit emotional content (as expressed through paralinguistic cues) in face to face (FTF) and telephone debt advice recordings. Twenty-two debt advice recordings were coded as emotional or functional and processed through emotion recognition software. The analysis found that FTF recordings included more explicit emotion than telephone recordings did. However, linear mixed effects modelling found substantially higher levels of arousal and slightly lower levels of valence in telephone advice. Interaction analyses found that emotional speech in FTF advice was characterised by lower levels of arousal than during functional speech, whereas emotional speech in telephone advice had higher levels of arousal than in functional speech. We can conclude that there are differences in emotional content when comparing full and reduced multimodal debt advice. Furthermore, as telephone advice cannot avail of visual nonverbal signals, it seems to compensate by using nonverbal cues present in the voice.

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping acoustic speech features to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). The lack of synchronized audio, video, and depth data is a limitation to reliably train DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the ASR-AM on ten thousand hours of audio-only transcribed speech. The ASR-AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to a randomly initialized model. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

Smooth Turn-taking by a Robot Using an Online Continuous Model to Generate Turn-taking Cues

Turn-taking in human-robot interaction is a crucial part of spoken dialogue systems, but current models do not allow for human-like turn-taking speed seen in natural conversation. In this work we propose combining two independent prediction models. A continuous model predicts the upcoming end of the turn in order to generate gaze aversion and fillers as turn-taking cues. This prediction is done while the user is speaking, so turn-taking can be done with little silence between turns, or even overlap. Once a speech recognition result has been received at a later time, a second model uses the lexical information to decide if or when the turn should actually be taken. We constructed the continuous model using the speaker’s prosodic features as inputs and evaluated its online performance. We then conducted a subjective experiment in which we implemented our model in an android robot and asked participants to compare it to one without turn-taking cues, which produces a response when a speech recognition result is received. We found that using both gaze aversion and a filler was preferred when the continuous model correctly predicted the upcoming end of turn, while using only gaze aversion was better if the prediction was wrong.

Towards Automatic Detection of Misinformation in Online Medical Videos

Recent years have witnessed a significant increase in the online sharing of medical information, with videos representing a large fraction of such online sources. Previous studies have however shown that more than half of the health-related videos on platforms such as YouTube contain misleading information and biases. Hence, it is crucial to build computational tools that can help evaluate the quality of these videos so that users can obtain accurate information to help inform their decisions. In this study, we focus on the automatic detection of misinformation in YouTube videos. We select prostate cancer videos as our entry point to tackle this problem. The contribution of this paper is twofold. First, we introduce a new dataset consisting of 250 videos related to prostate cancer manually annotated for misinformation. Second, we explore the use of linguistic, acoustic, and user engagement features for the development of classification models to identify misinformation. Using a series of ablation experiments, we show that we can build automatic models with accuracies of up to 74%, corresponding to a 76.5% precision and 73.2% recall for misinformative instances.

SESSION: Session 6: Multiparty interaction

Modeling Team-level Multimodal Dynamics during Multiparty Collaboration

We adopt a multimodal approach to investigating team interactions in the context of remote collaborative problem solving (CPS). Our goal is to understand multimodal patterns that emerge and their relation with collaborative outcomes. We measured speech rate, body movement, and galvanic skin response from 101 triads (303 participants) who used video conferencing software to collaboratively solve challenging levels in an educational physics game. We use multi-dimensional recurrence quantification analysis (MdRQA) to quantify patterns of team-level regularity, or repeated patterns of activity in these three modalities. We found that teams exhibit significant regularity above chance baselines. Regularity was unaffected by task factors. but had a quadratic relationship with session time in that it initially increased but then decreased as the session progressed. Importantly, teams that produce more varied behavioral patterns (irregularity) reported higher emotional valence and performed better on a subset of the problem solving tasks. Regularity did not predict arousal or subjective perceptions of the collaboration. We discuss implications of our findings for the design of systems that aim to improve collaborative outcomes by monitoring the ongoing collaboration and intervening accordingly.

Smile and Laugh Dynamics in Naturalistic Dyadic Interactions: Intensity Levels, Sequences and Roles

Smiles and laughs have been the subject of many studies over the past decades, due to their frequent occurrence in interactions, as well as their social and emotional functions in dyadic conversations. In this paper we push forward previous work by providing a first study on the influence one interacting partner’s smiles and laughs have on their interlocutor’s, taking into account these expressions’ intensities. Our second contribution is a study on the patterns of laugh and smile sequences during the dialogs, again taking the intensity into account. Finally, we discuss the effect of the interlocutor’s role on smiling and laughing. In order to achieve this, we use a database of naturalistic dyadic conversations which was collected and annotated for the purpose of this study. The details of the collection and annotation are also reported here to enable reproduction.

Task-independent Multimodal Prediction of Group Performance Based on Product Dimensions

This paper proposes an approach to develop models for predicting the performance for multiple group meeting tasks, where the model has no clear correct answer. This paper adopts ”product dimensions” [Hackman et al. 1967] (PD) which is proposed as a set of dimensions for describing the general properties of written passages that are generated by a group, as a metric measuring group output. This study enhanced the group discussion corpus called the MATRICS corpus including multiple discussion sessions by annotating the performance metric of PD. We extract group-level linguistic features including vocabulary level features using a word embedding technique, topic segmentation techniques, and functional features with dialog act and parts of speech on the word level. We also extracted nonverbal features from the speech turn, prosody, and head movement. With a corpus including multiple discussion data and an annotation of the group performance, we conduct two types of experiments thorough regression modeling to predict the PD. The first experiment is to evaluate the task-dependent prediction accuracy, in the situation that the samples obtained from the same discussion task are included in both the training and testing. The second experiments is to evaluate the task-independent prediction accuracy, in the situation that the type of discussion task is different between the training samples and testing samples. In this situation, regression models are developed to infer the performance in an unknown discussion task. The experimental results show that a support vector regression model archived a 0.76 correlation in the discussion-task-dependent setting and 0.55 in the task-independent setting.

Emergent Leadership Detection Across Datasets

Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards real-world emergent leadership detection.

A Multimodal Robot-Driven Meeting Facilitation System for Group Decision-Making Sessions

Group meetings are ubiquitous, with millions of meetings held across the world every day. However, meeting quality, group performance, and outcomes are challenged by a variety of dysfunctional behaviors, unproductive social dynamics, and lack of experience in conducting efficient and productive meetings. Previous studies have shown that meeting facilitators can be advantageous in helping groups reach their goals more effectively, but many groups do not have access to human facilitators due to a lack of resources or other barriers. In this paper, we describe the development of a multimodal robotic meeting facilitator that can improve the quality of small group decision-making meetings. This automated group facilitation system uses multimodal sensor inputs (user gaze, speech, prosody, and proxemics), as well as inputs from a tablet application, to intelligently enforce meeting structure, promote time management, balance group participation, and facilitate group decision-making processes. Results of a between-subject study of 20 user groups (N=40) showed that the robot facilitator is accepted by group members, is effective in enforcing meeting structure, and users found it helpful in balancing group participation. We also report design implications derived from the findings of our study.

SESSION: Poster Session

What's behind a choice? Understanding Modality Choices under Changing Environmental Conditions

Interacting with the physical and digital environment multimodally enhances user flexibility and adaptability to different scenarios. A body of research has focused on comparing the efficiency and effectiveness of different interaction modalities in digital environments. However, little is known about user behavior in an environment that provides freedom to choose from a range of modalities. That is why, we take a closer look at the factors that influence input modality choices. Building on the work by Jameson & Kristensson, our goal is to understand how different factors influence user choices. In this paper, we present a study that aims to explore modality choices in a hands-free interaction environment, wherein participants can choose and combine freely three hands-free modalities (Gaze, Head movements, Speech) to execute point and select actions in a 2D interface. On the one hand, our results show that users avoid switching modalities more often than we expected, particularly, under conditions that should prompt modality switching. On the other hand, when users make a modality switch, user characteristics and consequences of the experienced interaction have a higher impact in the choice, than the changes in environmental conditions. Further, when users switch between modalities, we identified different types of switching behaviors. Users who deliberately try to find and choose an optimal modality (single switcher), users who try to find optimal combinations of modalities (multiple switcher), and a switching behavior triggered by error occurrence (error biased switcher). We believe that these results help to further understand when and how to design for multimodal interaction in real-world systems.

Modeling Emotion Influence Using Attention-based Graph Convolutional Recurrent Network

User emotion modeling is a vital problem of social media analysis. In previous studies, content and topology information of social networks have been considered in emotion modeling tasks, but the inflence of current emotion states of other users was not considered. We define emotion influence as the emotional impact from user’s friends in social networks, which is determined by both network structure and node attributes (the features of friends). In this paper, we try to model the emotion influence to help analyze user’s emotion. The key challenges to this problem are: 1) how to combine content features and network structures together to model emotion influence; 2) how to selectively focus on the major social network information related to emotion influence. To tackle these challenges, we propose an attention-based graph convolutional recurrent network to bring in emotion influence and content data. Firstly, we use an attention-based graph convolutional network to selectively aggregate the features of the user’s friends with specific attention. Then an LSTM model is used to learn user’s own content features and emotion influence. The model we proposed is more capable of quantifying the emotion influence in social networks as well as combining them together to analyze the user emotion status. We conduct emotion classification experiments to evaluate the effectiveness of our model on a real world dataset called Sina Weibo1. Results show that our model outperforms several state-of-the-art methods.

Evaluation of Ultrasound Haptics as a Supplementary Feedback Cue for Grasping in Virtual Environments

This paper presents an evaluation of ultrasound mid-air haptics as a supplementary feedback cue for grasping and lifting virtual objects in Virtual Reality (VR). We present a user study with 27 participants and evaluate 6 different object sizes ranging from 40 mm to 100 mm. We compare three supplementary feedback cues in VR; mid-air haptics, visual feedback (glow effect) and no supplementary feedback. We report on precision metrics (time to completion, grasp aperture and grasp accuracy) and interaction metrics (post-test questionnaire, observations and feedback) to understand general trends and preferences. The results showed an overall preference for visual cues for bigger objects () while ultrasound mid-air haptics were preferred for small virtual targets ().

Understanding the Attention Demand of Touch and Tangible Interaction on a Composite Task

Bimanual input is frequently used on touch and tangible interaction on tabletop surfaces. Considering a composite task, such as moving a set of objects, attention, decision making and fine motor control have to be phased with the coordination of the two hands. However, attention demand is an important factor to design easy to learn and recall interaction techniques. This, determining what interaction modality demands less attention and which one performs better in these conditions is important to improve design. In this work, we present the first empirical results on this matter. We report that users are consistent in their assessments of the attention demand for both touch and tangible modalities, even under different hands synchronicity, and different population sizes and densities. Our findings indicate that the one hand condition and small populations demand less attention comparing to respectively, two hands conditions and bigger populations. Also, we show that tangible modality reduces significantly the attention when using two hands synchronous movements or when moving the sparse populations, decreases the movement time over touch modality, without compromising the traveled distance. We use our findings to outline a set of guidelines to assist touch and tangible design.

TouchGazePath: Multimodal Interaction with Touch and Gaze Path for Secure Yet Efficient PIN Entry

We present TouchGazePath, a multimodal method for entering personal identification numbers (PINs). Using a touch-sensitive display showing a virtual keypad, the user initiates input with a touch at any location, glances with their eye gaze on the keys bearing the PIN numbers, then terminates input by lifting their finger. TouchGazePath is not susceptible to security attacks, such as shoulder surfing, thermal attacks, or smudge attacks. In a user study with 18 participants, TouchGazePath was compared with the traditional Touch-Only method and the multimodal Touch+Gaze method, the latter using eye gaze for targeting and touch for selection. The average time to enter a PIN with TouchGazePath was 3.3 s. This was not as fast as Touch-Only (as expected), but was about twice as fast the Touch+Gaze. TouchGazePath was also more accurate than Touch+Gaze. TouchGazePath had high user ratings as a secure PIN input method and was the preferred PIN input method for 11 of 18 participants.

WiBend: Wi-Fi for Sensing Passive Deformable Surfaces

We present WiBend, a system that recognizes bending gestures as the input modalities for interacting on non-instrumented and deformable surfaces using WiFi signals. WiBend takes advantage of off-the-shelf 802.11 (Wi-Fi) devices and Channel State Information (CSI) measurements of packet transmissions when the user is placed and interacting between a Wi-Fi transmitter and a receiver. We have performed extensive user experiments in an instrumented laboratory to obtain data for training the HMM models and for evaluating the precision of WiBend. During the experiments, participants performed 12 distinct bending gestures with three surface sizes, two bending speeds and two different directions. The performance evaluation results show that WiBend can distinguish between 12 bending gestures with a precision of 84% on average.

ElderReact: A Multimodal Dataset for Recognizing Emotional Response in Aging Adults

Automatic emotion recognition plays a critical role in technologies such as intelligent agents and social robots and is increasingly being deployed in applied settings such as education and healthcare. Most research to date has focused on recognizing the emotional expressions of young and middle-aged adults and, to a lesser extent, children and adolescents. Very few studies have examined automatic emotion recognition in older adults (i.e., elders), which represent a large and growing population worldwide. Given that aging causes many changes in facial shape and appearance and has been found to alter patterns of nonverbal behavior, there is strong reason to believe that automatic emotion recognition systems may need to be developed specifically (or augmented) for the elder population. To promote and support this type of research, we introduce a newly collected multimodal dataset of elders reacting to emotion elicitation stimuli. Specifically, it contains 1323 video clips of 46 unique individuals with human annotations of six discrete emotions: anger, disgust, fear, happiness, sadness, and surprise as well as valence. We present a detailed analysis of the most indicative features for each emotion. We also establish several baselines using unimodal and multimodal features on this dataset. Finally, we show that models trained on dataset of another age group do not generalize well on elders.

Unsupervised Deep Fusion Cross-modal Hashing

To handle the large-scale data in terms of storage and searching time, learning to hash becomes popular due to its efficiency and effectiveness in approximate cross-modal nearest neighbors searching. Most existing unsupervised cross-modal hashing methods, to shorten the semantic gap, try to simultaneously minimize the loss of intra-modal similarity and the loss of inter-modal similarity. However, these models can not guarantee in theory these two losses are simultaneously minimized. In this paper, we first theoretically proved that cross-modal hashing could be implemented by protecting both intra-modal and inter-modal similarity with the aid of variational inference technique and point out the problem that maximizing intra and inter-modal similarity is mutually constrained. In this case, we propose an unsupervised cross-modal hashing framework named as Unsupervised Deep Fusion Cross-modal Hashing (UDFCH) which leverages the data fusion to capture the underlying manifold across modalities to avoid above problem. What’s more, in order to reduce the quantization loss, we sample hash codes from different Bernoulli distributions through a reparameterization trick. Our UDFCH framework has two stages. The first stage aims at mining the the intra-modal structure of each modality. The second stage aims to determine the modality-aware hash code by sufficiently considering the correlation and manifold structure among modalities. A series of experiments conducted on three benchmark datasets show that the proposed UDFCH framework outperforms the state-of-the-art methods on different cross-modal retrieval tasks.

DIF : Dataset of Perceived Intoxicated Faces for Drunk Person Identification

Traffic accidents cause over a million deaths every year, of which a large fraction is attributed to drunk driving. An automated intoxicated driver detection system in vehicles will be useful in reducing accidents and related financial costs. Existing solutions require special equipment such as electrocardiogram, infrared cameras or breathalyzers. In this work, we propose a new dataset called DIF (Dataset of perceived Intoxicated Faces) which contains audio-visual data of intoxicated and sober people obtained from online sources. To the best of our knowledge, this is the first work for automatic bimodal non-invasive intoxication detection. Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) are trained for computing the video and audio baselines, respectively. 3D CNN is used to exploit the Spatio-temporal changes in the video. A simple variation of the traditional 3D convolution block is proposed based on inducing non-linearity between the spatial and temporal channels. Extensive experiments are performed to validate the approach and baselines.

Generative Model of Agent’s Behaviors in Human-Agent Interaction

A social interaction implies a social exchange between two or more persons, where they adapt and adjust their behaviors in response to their interaction partners. With the growing interest in human-agent interactions, it is desirable to make these interactions more natural and human like. In this context, we aim at enhancing the quality of the interaction between user and Embodied Conversational Agent (ECA) by endowing ECA with the capacity to adapt its behavior in real time according the user’s behavior. The novelty of our approach is to model the agent’s nonverbal behaviors as a function of both agent’s and user’s behaviors jointly with the agent’s communicative intentions creating a dynamic loop between both interactants. Moreover, we encompass the variation of behavior over time through a LSTM-based model. Our model IL-LSTM (Interaction Loop LSTM) predicts the next agent’s behavior taking into account the behavior that both, the agent and the user, have displayed within a time window. We have conducted an evaluation study involving an agent interacting with visitors in a science museum. Results of our study show that participants have better experience and are more engaged in the interaction when the agent adapts its behaviors to theirs, thus creating an interactive loop.

Improved Visual Focus of Attention Estimation and Prosodic Features for Analyzing Group Interactions

Collaborative group tasks require efficient and productive verbal and non-verbal interactions among the participants. Studying such interaction patterns could help groups perform more efficiently, but the detection and measurement of human behavior is challenging since it is inherently multimodal and changes on a millisecond time frame. In this paper, we present a method to study groups performing a collaborative decision-making task using non-verbal behavioral cues. First, we present a novel algorithm to estimate the visual focus of attention (VFOA) of participants using frontal cameras. The algorithm can be used in various group settings, and performs with a state-of-the-art accuracy of 90%. Secondly, we present prosodic features for non-verbal speech analysis. These features are commonly used in speech/music classification tasks, but are rarely used in human group interaction analysis. We validate our algorithms on a multimodal dataset of 14 group meetings with 45 participants, and show that a combination of VFOA-based visual metrics and prosodic-feature-based metrics can predict emergent group leaders with 64% accuracy and dominant contributors with 86% accuracy. We also report our findings on the correlations between the non-verbal behavioral metrics with gender, emotional intelligence, and the Big 5 personality traits.

DeepReviewer: Collaborative Grammar and Innovation Neural Network for Automatic Paper Review

Nowadays, there are more and more papers submitted to various periodicals and conferences. Typically, reviewers need to read through the paper and give a review comment and score to it based on somehow certain criterion. This review process is labor intensive and time-consuming. Recently, AI technology is widely used to alleviate human labor burden. Can machine learn from human to review papers automatically? In this paper, we propose a collaborative grammar and innovation model - DeepReviewer to achieve automatic paper review. This model learning the semantic, grammar and innovative features of an article by three main well-designed components simultaneously. Moreover, these three factors are integrated by an attention layer to get the final review score of the paper. We crawled paper review data from Openreview and built a real data set. Experimental results demonstrate that our model exceeds many baselines.

CorrFeat: Correlation-based Feature Extraction Algorithm using Skin Conductance and Pupil Diameter for Emotion Recognition

To recognize emotions using less obtrusive wearable sensors, we present a novel emotion recognition method that uses only pupil diameter (PD) and skin conductance (SC). Psychological studies show that these two signals are related to the attention level of humans exposed to visual stimuli. Based on this, we propose a feature extraction algorithm that extract correlation-based features for participants watching the same video clip. To boost performance given limited data, we implement a learning system without a deep architecture to classify arousal and valence. Our method outperforms not only state-of-art approaches, but also widely-used traditional and deep learning methods.

Multimodal Behavioral Markers Exploring Suicidal Intent in Social Media Videos

Suicide is one of the leading causes of death in the modern world. In this digital age, individuals are increasingly using social media to express themselves and often use these platforms to express suicidal intent. Various studies have inspected suicidal intent behavioral markers in controlled environments but it is still unexplored if such markers will generalize to suicidal intent expressed on social media. In this work, we set out to study multimodal behavioral markers related to suicidal intent when expressed on social media videos. We explore verbal, acoustic and visual behavioral markers in the context of identifying individuals at higher risk of suicidal attempt. Our analysis reveals that frequent silences, slouched shoulders, rapid hand movements and profanity are predominant multimodal behavioral markers indicative of suicidal intent1.

Estimating Uncertainty in Task-Oriented Dialogue

Situated multimodal systems that instruct humans need to handle user uncertainties, as expressed in behaviour, and plan their actions accordingly. Speakers’ decision to reformulate or repair previous utterances depends greatly on the listeners’ signals of uncertainty. In this paper, we estimate uncertainty in a situated guided task, as leveraged in non-verbal cues expressed by the listener, and predict that the speaker will reformulate their utterance. We use a corpus where people instruct how to assemble furniture, and extract their multimodal features. While uncertainty is in cases verbally expressed, most instances are expressed non-verbally, which indicates the importance of multimodal approaches. In this work, we present a model for uncertainty estimation. Our findings indicate that uncertainty estimation from non-verbal cues works well, and can exceed human annotator performance when verbal features cannot be perceived.

Determining Iconic Gesture Forms based on Entity Image Representation

Iconic gestures are used to depict physical objects mentioned in speech, and the gesture form is assumed to be based on the image of a given object in the speaker’s mind. Using this idea, this study proposes a model that learns iconic gesture forms from an image representation obtained from pictures of physical entities. First, we collect a set of pictures of each entity from the web, and create an average image representation from them. Subsequently, the average image representation is fed to a fully connected neural network to decide the gesture form. In the model evaluation experiment, our two-step gesture form selection method can classify seven types of gesture forms with over 62% accuracy. Furthermore, we demonstrate an example of gesture generation in a virtual agent system in which our model is used to create a gesture dictionary that assigns a gesture form for each entry word in the dictionary.

Interaction Process Label Recognition in Group Discussion

In qualifying and analyzing the performance of group interaction, interaction processing analysis (IPA) defined by Bale is considered a useful approach. IPA is a system for labeling a total of 12 interaction categories for the interaction process. Automatic IPA can manually encompass the gap in spending manpower and can efficiently qualify group performance. In this paper, we present computational interaction processing analysis by developing a model to recognize categories of IPA. We extract both verbal features and nonverbal features for IPA category recognition modeling with SVM, RF, DNN and LSTM machine learning algorithms and analyze the contribution of multimodal features and unimodal features for the total data and each label. We also investigate the effect of context information by training sequences with different lengths with an LSTM and evaluating them. The results show that multimodal features achieve the best performance with an F1 score of 0.601 for the recognition of 12 IPA categories using the total data. Multimodal features are better than the unimodal features for the total data and most labels. The results of investigating context information show that a suitable length of sequence enables a longer sequence to achieve the best F1 score of 0.602 and a better performance for recognition.

Exploring Transfer Learning between Scripted and Spontaneous Speech for Emotion Recognition

Internet of Things technologies yield large amounts of real-life speech data related to human emotions. Yet, labelled data of human emotion from spontaneous speech are extremely limited due to the difficulties in the annotation of such large volumes of audio samples. A potential way to address this limitation is to augment emotion models of spontaneous speech with fully annotated data collected using scripted scenarios. We investigate whether and to what extent knowledge related to speech emotional content can be transferred between datasets of scripted and spontaneous speech. We implement transfer learning through: (1) a feed-forward neural network trained on the source data and whose last layers are fine-tuned based on the target data; and (2) a progressive neural network retaining a pool of pre-trained models and learning lateral connections between source and target task. We explore the effectiveness of the proposed approach using four publicly available datasets of emotional speech. Our results indicate that transfer learning can effectively leverage corpora of scripted data to improve emotion recognition performance for spontaneous speech.

Engagement Modeling in Dyadic Interaction

In the recent years, engagement modeling has gained increasing attention due the important role it plays in human-agent interaction. The agent should be able to detect, in real time, the engagement level of the user in order to react accordingly. In this context, our goal is to develop a computational model to predict engagement level of the user in real time. Relying on previous findings, we use facial expressions, head movements and gaze direction as predictive features. Moreover, engagement is not only measured from single cues, but from the combination of several cues that arise over a certain time window. Thus, for better engagement prediction, we consider the variation of multimodal behaviors over time. To this end, we rely on LSTM that can jointly model the temporality and the sequentiality of multimodal behaviors.

SESSION: Doctoral Consortium

Detecting Temporal Phases of Anxiety in The Wild: Toward Continuously Adaptive Self-Regulation Technologies

Anxiety disorders are becoming more prevalent; therefore, the demand for mobile anxiety self-regulation technologies is rising. However, the existing regulation technologies have not yet reached the ability to guide suitable interventions to a user in a timely manner. This is mainly due to the lack of maturity in the anxiety detection area. Hence, this research aims to (1) identify potential temporal phases of anxiety which could become effective personalization parameters for regulation technologies, (2) detect such phases through collecting and analyzing multimodal indicators of anxiety, and (3) design self-regulation technologies that can guide suitable interventions for the detected anxiety phase. Based on an exploratory study that was conducted with therapists treating anxiety disorders, potential temporal phases and common indicators of anxiety were identified. The design of anxiety detection and regulation technologies is currently in progress. The proposed research methodology and expected contributions are further discussed in this paper.

Multimodal Machine Learning for Interactive Mental Health Therapy

Mental health disorders are among the leading causes of disability. Despite the prevalence of mental health disorders, there is a large gap between the needs and resources available for their assessment and treatment. Automatic behaviour analysis for computer-aided mental health assessment can augment clinical resources in the diagnosis and treatment of patients. Intelligent systems like virtual agents and social robots can have a large impact by deploying multimodal machine learning to perceive and interact with patients in interactive scenarios for probing behavioral cues of mental health disorders. In this paper, we propose our plans for developing multimodal machine learning methods for augmenting embodied interactive agents with emotional intelligence, toward probing cues of mental health disorders. We aim to develop a new generation of intelligent agents that can create engaging interactive experiences for assisting with mental health assessments.

Tailoring Motion Recognition Systems to Children’s Motions

Motion-based applications are becoming increasingly popular among children and require accurate motion recognition to ensure meaningful interactive experiences. However, motion recognizers are usually trained on adults’ motions. Children and adults differ in terms of their body proportions and development of their neuromuscular systems, so children and adults will likely perform motions differently. Hence, motion recognizers tailored to adults will likely perform poorly for children. My PhD thesis will focus on identifying features that characterize children’s and adults’ motions. This set of features will provide a model that can be used to understand children’s natural motion qualities and will serve as the first step in tailoring recognizers to children’s motions. This paper describes my past and ongoing work toward this end and outlines the next steps in my PhD work.

Multi-modal Fusion Methods for Robust Emotion Recognition using Body-worn Physiological Sensors in Mobile Environments

High-accuracy physiological emotion recognition typically requires participants to wear or attach obtrusive sensors (e.g., Electroencephalograph). To achieve precise emotion recognition using only wearable body-worn physiological sensors, my doctoral work focuses on researching and developing a robust sensor fusion system among different physiological sensors. Developing such fusion system has three problems: 1) how to pre-process signals with different temporal characteristics and noise models, 2) how to train the fusion system with limited labeled data and 3) how to fuse multiple signals with inaccurate and inexact ground truth. To overcome these challenges, I plan to explore semi-supervised, weakly supervised and unsupervised machine learning methods to obtain precise emotion recognition in mobile environments. By developing such techniques, we can measure the user engagement with larger amounts of participants and apply the emotion recognition techniques in a variety of scenarios such as mobile video watching and online education.

Communicative Signals and Social Contextual Factors in Multimodal Affect Recognition

One research branch in Affective Computing focuses on using multimodal ‘emotional’ expressions (e.g. facial expressions or non-verbal vocalisations) to automatically detect emotions and affect experienced by persons. The field is increasingly interested in using contextual factors to better infer emotional expressions rather than solely relying on the emotional expressions by themselves. We are interested in expressions that occur in a social context. In our research we plan to investigate how we can; a) utilise communicative signals that are displayed during interactions to recognise social contextual factors that influence emotion expression and in turn b) predict/recognise what these emotion expressions are most likely communicating considering the context. To achieve this, we formulate three main research questions: I) How do communicative signals such as emotion expressions co-ordinate behaviours and knowledge between interlocutors in interactive settings?, II) Can we use behavioural cues during interactions to detect social contextual factors relevant for interpreting affect? and III) Can we use social contextual factors and communicative signals to predict what emotion experience is linked to an emotion expression?

Co-located Collaboration Analytics

Collaboration is an important skill of the 21st century. It can take place in an online (or remote) setting or in a co-located (or face-to-face) setting. With the large scale adoption of sensor use, studies on co-located collaboration (CC) has gained momentum. CC takes place in physical spaces where the group members share each other’s social and epistemic space. This involves subtle multimodal interactions such as gaze, gestures, speech, discourse which are complex in nature. The aim of this PhD is to detect these interactions and then use these insights to build an automated real-time feedback system to facilitate co-located collaboration.

Coalescing Narrative and Dialogue for Grounded Pose Forecasting

This research aims to create a data-driven end-to-end model for multimodal forecasting body pose and gestures of virtual avatars. A novel aspect of this research is to coalesce both narrative and dialogue for pose forecasting. In a narrative, language is used in a third person view to describe the avatar actions. In dialogue both first and second person views need to be integrated to accurately forecast avatar pose. Gestures and poses of a speaker are linked to other modalities: language and acoustics. We use these correlations to better predict the avatar’s pose.

Attention-driven Interaction Systems for Augmented Reality

Augmented reality (AR) glasses enable the embedding of visual content in a real-world surroundings. In this PhD project, I will implement user interfaces which adapt to the cognitive state of the user, for example by avoiding distractions or re-directing the user’s attention towards missed information. For this purpose, sensory data from the user is captured (Brain activity via EEG of fNIRS, eye tracking, physiological measurements) and modeled with machine learning techniques. The focus of the cognitive state estimation is centered around attention related aspects. The main task is to build models for an estimation of a person’s attentional state from the combination and classification of multimodal data streams and context information, as well as their evaluation. Furthermore, the goal is to develop prototypical user interfaces for AR glasses and to test their usability in different scenarios.

Multimodal Driver Interaction with Gesture, Gaze and Speech

The ever-growing research in computer vision has created new avenues for user interaction. Speech commands and gesture recognition are already being applied in various touch-based inputs. It is, therefore, foreseeable, that the use of multimodal input methods for user interaction is the next phase in development. In this paper, I propose a research plan of novel methods for the use of multimodal inputs for the semantic interpretation of human-computer interaction, specifically applied to a car driver. A fusion methodology has to be designed that adequately makes use of a recognized gesture (specifically finger pointing), eye gaze and head pose for the identification of reference objects, while using the semantics from speech for a natural interactive environment for the driver. The proposed plan includes different techniques based on artificial neural networks for the fusion of the camera-based modalities (gaze, head and gesture). It then combines features extracted from speech with the fusion algorithm to determine the intent of the driver.

SESSION: Demo and Exhibit Session

The Dyslexperience: Use of Projection Mapping to Simulate Dyslexia*

There is a lack of awareness about dyslexia among people in our society. More often than not, there are many misconceptions surrounding the diagnosis of dyslexia, leading to misjudgements and misunderstanding about dyslexics from the workplace to school. This paper presents a multimodal interactive installation designed to communicate the emotional ordeal faced by dyslexics, allowing those who do not understand to see through the lens of those with dyslexia. The main component of this installation is a projection mapping technique used to enhance typography, simulating the experience of dyslexia. Projection mapping makes it possible to create a natural augmented information presentation method on the tangible surface of a specially designed printed book. The user interface combines a color–tracking sensor and a projection to create a camera–projector system. The described system performs tabletop object detection and automatic projection mapping, using page flipping as user interaction. Such a system can be adapted to fit different contexts and installation spaces, for the purpose of education and awareness. There is also the potential to conduct further research with real dyslexia patients.

A Real-Time Scene Recognition System Based on RGB-D Video Streams

Depth data captured by the cameras such as Microsoft Kinect can bring depth information than traditional RGB data, which is also more robust to different environments, such as dim or dark lighting conditions. In this technical demonstration, we build a scene recognition system based on real-time processing of RGB-D video streams. Our system recognizes the scenes with video clips, where three types of threads are implemented to ensure the realtime. This system first buffers the frames of both RGB and depth videos with the capturing threads. When the buffered videos reach the certain length, the frames will be packed into clips and forwarded in a pre-trained C3D model to predict scene labels with the scene recognition thread. Finally, the predicted scene labels and captured videos are illustrated in our user interface with illustration thread.

Hang Out with the Language Assistant

AI assistants have found their place in households but most of the existing assistants use single modal interaction. We present a language assistant for kids called Hola (Hang out with the Language Assistant) which is a true multimodal assistant. Hola is a small mobile robot based assistant capable of understanding the objects around it and responding to questions about objects that it can see. Hola is also able to adjust the camera position and its own position to make an extra attempt to understand the object using robot control mechanism. The technology behind it uses a combination of natural language understanding, object detection, and hand pose detection. In addition, Hola also supports reading book in the form of storytelling for kids using OCR. Children can ask a question about any word that they do not understand and Hola can retrieve the information from the internet and tells the meaning, other details of the word. After reading the book or a page, the robot asks the child based on the words used in the book to confirm the child’s understanding of the book.

A Searching and Automatic Video Tagging Tool for Events of Interest during Volleyball Training Sessions

Quick and easy access to performance data during matches and training sessions is important for both players and coaches. While there are many video tagging systems available, these systems require manual effort. This paper proposes a system architecture that automatically supplements video recording by detecting events of interests in volleyball matches and training sessions to provide tailored and interactive multi-modal feedback.

Seeing Is Believing but Feeling Is the Truth: Visualising Mid-Air Haptics in Oil Baths and Lightboxes

Ultrasound is beyond the range of human hearing and tactile perception. In the past few years, several modulation techniques have been invented to overcome this and evoke perceptible tactile sensations of shapes and textures that can be felt, but not seen. Therefore, mid-air haptic technology has found use in several human computer interaction applications and is the focus of multiple research efforts. Visualising the induced acoustic pressure field can help understand and optimise how different modulation techniques translate into tactile sensations. Here, rather than using acoustic simulation tools to do that, we exploit the micro-displacement of a thin layer of oil to visualize the impinging acoustic pressure field outputted from an ultrasonic phased array device. Our demo uses a light source to illuminate the oil displacement and project it onto a screen to produce an interactive lightbox display. Interaction is facilitated via optical hand-tracking technology thus enabling an instantaneous and aesthetically pleasing visualisation of mid-air haptics.

Chemistry Pods: A Mutlimodal Real Time and Retrospective Tool for the Classroom

Instructors are often multitasking in the classroom. This makes it increasingly difficult for them to pay attention to each individual’s engagement especially during activities where students are working in groups. In this paper, we describe a system that aids instructors in supporting group collaboration by utilizing a centralized, easy-to-navigate dashboard connected to multiple pods dispersed among groups of students in a classroom or laboratory. This allows instructors to check multiple qualities of the discussion such as: the usage of instructor specified keywords, relative participation of each individual, the speech acts students are using and different emotional characteristics of group language.

A Proxemics Measurement Tool Integrated into VAIF and Unity

SESSION: Challenge 1: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge

Transfer Learning Methods for Spoken Language Understanding

In this paper, we present a series of methods to improve the performance of spoken language understanding in the 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU 2019) which is aimed to improve the robustness for automatic speech recognition (ASR) errors and to solve the problem of not enough labeled data in new domains. We combine word information and char information to improve the performance of the semantic parser. We also use some transfer learning methods like correlation alignments to improve the robustness of the spoken language understanding system. Then we merge the rule method and the neural network method to raise system output performance. In video and weather domains with few training data, we use both the transfer learning model trained on multi-domain data and the rule-based approach. Our approaches achieve F1 scores of 86.83%, 92.84%, 94.16%, and 93.04% on the test sets of map, music, video and weather domains.

Streamlined Decoder for Chinese Spoken Language Understanding

As a critical component of Spoken Dialog System (SDS), spoken language understanding (SLU) attracts a lot of attention, especially for methods based on unaligned data. Recently, a new approach has been proposed that utilizes the hierarchical relationship between act-slot-value triples. However, it ignores the transfer of internal information which may record the intermediate information of the upper level and contribute to the prediction of the lower level. So, we propose a novel streamlined decoding structure with attention mechanism, which uses three successively connected RNN to decode act, slot and value respectively. On the first Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU), our model exceeds state-of-the-art model on an unaligned multi-turn task-oriented Chinese spoken dialogue dataset provided by the contest.

CATSLU: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge

Spoken language understanding (SLU) is a key component of conversational dialogue systems, which converts user utterances into semantic representations. The previous works almost focus on parsing semantic from textual inputs (top hypothesis of speech recognition and even manual transcripts) while losing information hidden in the audio. We herein describe the 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU) which introduces a new dataset with audio-textual information, multiple domains and domain knowledge. We introduce two scenarios of audio-textual SLU in which participants are encouraged to utilize data of other domains or not. In this paper, we will describe the challenge and results.

Multi-Classification Model for Spoken Language Understanding

The spoken language understanding (SLU) is an important part of spoken dialogue system (SDS). In the paper, we focus on how to extract a set of act-slot-value tuples from users’ utterances in the 1st Chinese Audio-Textual Spoken Language Understanding Challenge (CATSLU). This paper adopts the pretrained BERT model to encode users’ utterances and builds multiple classifiers to get the required tuples. In our framework, finding acts and values of slots are recognized as classification tasks respectively. Such multi-task training is expected to help the encoder to get better understanding of the utterance. Since the system is built on the transcriptions given by automatic speech recognition (ASR), some tricks are applied to correct the errors of the tuples. We also found that using the minimum edit distance (MED) between results and candidates to rebuild the tuples was beneficial in our experiments.

Robust Spoken Language Understanding with Acoustic and Domain Knowledge

Spoken language understanding (SLU) converts user utterances into structured semantic forms. There are still two main issues for SLU: robustness to ASR-errors and the data sparsity of new and extended domains. In this paper, we propose a robust SLU system by leveraging both acoustic and domain knowledge. We extract audio features by training ASR models on a large number of utterances without semantic annotations. For exploiting domain knowledge, we design lexicon features from the domain ontology and propose an error elimination algorithm to help predicted values recovered from ASR-errors. The results of CATSLU challenge show that our systems can outperform all of the other teams across four domains.

SESSION: Challenge 2: The 1st Mandarin Audio-Visual Speech Recognition Challenge (MAVSR)

Spotting Visual Keywords from Temporal Sliding Windows

Visual Keyword Spotting (KWS), as a newly proposed task deriving from visual speech recognition, has plenty of room for improvements. This paper details our Visual Keyword Spotting system used in the first Mandarin Audio-Visual Speech Recognition Challenge (MAVSR 2019). With the assumption that the vocabularies of target dataset are a subset of the vocabulary of the training set, we proposed a simple and scalable classification based strategy that achieves 19.0% mean average precision (mAP) on this challenge. Our method is based on the idea of using sliding windows to bridge between the word-level dataset and the sentence-level dataset, showing that a strong word level classifier can be directly used in building sentence embedding, thereby making it possible to build a KWS system.

Deep Audio-visual System for Closed-set Word-level Speech Recognition

Audio-visual understanding is usually challenged by the complementary gap between audio and visual informative bridging. Motivated by the recent audio-visual studies, a closed-set word-level speech recognition scheme is proposed for the Mandarin Audio-Visual Speech Recognition (MAVSR) Challenge in this study. To achieve respective audio and visual encoder initialization more effectively, a 3-dimensional convolutional neural network (CNN) and an attention-based bi-directional long short-term memory (Bi-LSTM) network are trained. With two fully connected layers in addition to the concatenated encoder outputs for the audio-visual joint training, the proposed scheme won the first place with a relative word accuracy improvement of 7.9% over the solitary audio system. Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.

SESSION: Challenge 3: Seventh Emotion Recognition in the Wild Challenge (EmotiW)

EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks

This paper describes the Seventh Emotion Recognition in the Wild (EmotiW) Challenge. The EmotiW benchmarking platform provides researchers with an opportunity to evaluate their methods on affect labelled data. This year EmotiW 2019 encompasses three sub-challenges: a) Group-level cohesion prediction; b) Audio-Video emotion recognition; and c) Student engagement prediction. We discuss the databases used, the experimental protocols and the baselines.

Bootstrap Model Ensemble and Rank Loss for Engagement Intensity Regression

This paper presents our approach for the engagement intensity regression task of EmotiW 2019. The task is to predict the engagement intensity value of a student when he or she is watching an online MOOCs video in various conditions. Based on our winner solution last year, we mainly explore head features and body features with a bootstrap strategy and two novel loss functions in this paper. We maintain the framework of multi-instance learning with long short-term memory (LSTM) network, and make three contributions. First, besides of the gaze and head pose features, we explore facial landmark features in our framework. Second, inspired by the fact that engagement intensity can be ranked in values, we design a rank loss as a regularization which enforces a distance margin between the features of distant category pairs and adjacent category pairs. Third, we use the classical bootstrap aggregation method to perform model ensemble which randomly samples a certain training data by several times and then averages the model predictions. We evaluate the performance of our method and discuss the influence of each part on the validation dataset. Our methods finally win 3rd place with MSE of 0.0626 on the testing set. https://github.com/kaiwang960112/EmotiW_2019_ engagement_regression

Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction

This paper presents our approach for the group cohesion prediction sub-challenge in the EmotiW 2019. The task is to predict group cohesiveness in images. We mainly explore several regularizations with three types of visual cues, namely face, body ,and global image. Our main contribution is two-fold. First, we jointly train the group cohesion prediction task and group emotion recognition task using multi-task learning strategy with all visual cues. Second, we elaborately design two regularizations, namely a rank loss and a hourglass loss, where the former aims to give a margin between the distance of distant categories and near categories and the later to avoid centralization predictions with only MSE loss. With careful evaluations, we finally achieve the second place in this sub-challenge with MSE of 0.43821 on the testing set. https://github.com/DaleAG/Group_Cohesion_Prediction

Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention mechanisms to highlights important emotion feature, exploring feature concatenation and factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank third in the challenge.

Engagement Intensity Prediction withFacial Behavior Features

This paper describes an approach for the engagement prediction task, a sub-challenge of the 7th Emotion Recognition in the Wild Challenge (EmotiW 2019). Our method involves three fundamental steps: feature extraction, regression and model ensemble. In the first step, an input video is divided into multiple overlapped segments (instances) and the features extracted for each instance. The combinations of Long short-term memory (LSTM) and Fully connected layers deployed to capture the temporal information and regress the engagement intensity for the features in previous step. In the last step, we performed fusions to achieve better performance. Finally, our approach achieved a mean square error of 0.0597, which is 4.63% lower than the best results last year.

Group-level Cohesion Prediction using Deep Learning Models with A Multi-stream Hybrid Network

In this paper, we propose a hybrid deep learning network for predicting group cohesion in images. It is a kind of regression problem and its objective is to predict the Group Cohesion Score (GCS), which is in the range of [0,3]. In order to solve this issue, we exploit four types of visual cues, such as scene, skeleton, UV coordinates and face image, along with state-of-the-art convolutional neural networks (CNNs). We use not only fusion but also ensemble methods to combine these approaches. Our proposed hybrid network achieves 0.517 and 0.416 mean square errors (MSEs) on validation and testing sets, respectively. We finally achieved the first place on the Group-level Cohesion Sub-challenge (GC) in the EmotiW 2019.

Automatic Group Cohesiveness Detection With Multi-modal Features

Group cohesiveness is a compelling and often studied composition in group dynamics and group performance. The enormous number of web images of groups of people can be used to develop an effective method to detect group cohesiveness. This paper introduces an automatic group cohesiveness prediction method for the 7th Emotion Recognition in the Wild (EmotiW 2019) Grand Challenge in the category of Group-based Cohesion Prediction. The task is to predict the cohesive level for a group of people in images. To tackle this problem, a hybrid network including regression models which are separately trained on face features, skeleton features, and scene features is proposed. Predicted regression values, corresponding to each feature, are fused for the final cohesive intensity. Experimental results demonstrate that the proposed hybrid network is effective and makes promising improvements. A mean squared error (MSE) of 0.444 is achieved on the testing sets which outperforms the baseline MSE of 0.5.

Multi-feature and Multi-instance Learning with Anti-overfitting Strategy for Engagement Intensity Prediction

This paper proposes a novel engagement intensity prediction approach, which is also applied in the EmotiW Challenge 2019 and resulted in good performance. The task is to predict the engagement level when a subject student is watching an educational video in diverse conditions and various environments. Assuming that the engagement intensity has a strong correlation with facial movements, upper-body posture movements and overall environmental movements in a time interval, we extract and incorporate these motion features into a deep regression model consisting of layers with a combination of LSTM, Gated Recurrent Unit (GRU) and a Fully Connected Layer. In order to precisely and robustly predict the engagement level in a long video with various situations such as darkness and complex background, a multi-features engineering method is used to extract synchronized multi-model features in a period of time by considering both the short-term dependencies and long-term dependencies. Based on the well-processed features, we propose a strategy for maximizing validation accuracy to generate the best models covering all the model configurations. Furthermore, to avoid the overfitting problem ascribed to the extremely small database, we propose another strategy applying a single Bi-LSTM layer with only 16 units to minimize the overfitting, and splitting the engagement dataset (train + validation) with 5-fold cross validation (stratified k-fold) to train the conservative model. By ensembling the above models, our methods finally win the second place in the challenge with MSE of 0.06174 on the testing set.

Bi-modality Fusion for Emotion Recognition in the Wild

The emotion recognition in the wild has been a hot research topic in the field of affective computing. Though some progresses have been achieved, the emotion recognition in the wild is still an unsolved problem due to the challenge of head movement, face deformation, illumination variation etc. To deal with these unconstrained challenges, we propose a bi-modality fusion method for video based emotion recognition in the wild. The proposed framework takes advantages of the visual information from facial expression sequences and the speech information from audio. The state-of-the-art CNN based object recognition models are employed to facilitate the facial expression recognition performance. A bi-direction long short term Memory (Bi-LSTM) is employed to capture dynamic information of the learned features. Additionally, to take full advantages of the facial expression information, the VGG16 network is trained on AffectNet dataset to learn a specialized facial expression recognition model. On the other hand, the audio based features, like low level descriptor (LLD) and deep features obtained by spectrogram image, are also developed to improve the emotion recognition performance. The best experimental result shows that the overall accuracy of our algorithm on the Test dataset of the EmotiW challenge is 62.78, which outperforms the best result of EmotiW2018 and ranks 2nd at the EmotiW2019 challenge.

Multi-Attention Fusion Network for Video-based Emotion Recognition

Humans routinely pay attention to important emotion information from visual and audio modalities without considering multimodal alignment issues, and recognize emotions by integrating important multimodal information at a certain interval. In this paper, we propose a multiple attention fusion network (MAFN) with the goal of improving emotion recognition performance by modeling human emotion recognition mechanisms. MAFN consists of two types of attention mechanisms: the intra-modality attention mechanism is applied to dynamically extract representative emotion features from a single modal frame sequences; the inter-modality attention mechanism is applied to automatically highlight specific modal features based on their importance. In addition, we define a multimodal domain adaptation method to have a positive effect on capturing interactions between modalities. MAFN achieved 58.65% recognition accuracy with the AFEW testing set, which is a significant improvement compared with the baseline of 41.07%.