Human verbal and nonverbal expressions carry crucial information not only about intent but also emotions, individual identity, and the state of health and wellbeing. From a basic science perspective, understanding how such rich information is encoded in these signals can illuminate underlying production mechanisms including the variability therein, within and across individuals. From a technology perspective, finding ways for automatically processing and decoding this complex information continues to be of interest across a variety of applications.
The convergence of sensing, communication and computing technologies is allowing access to data, in diverse forms and modalities, in ways that were unimaginable even a few years ago. These include data that afford the multimodal analysis and interpretation of the generation of human expressions.
The first part of the talk will highlight advances that allow us to perform investigations on the dynamics of vocal production using real-time imaging and audio modeling to offer insights about how we produce speech and song with the vocal instrument. The second part of the talk will focus on the production of vocal expressions in conjunction with other signals from the face and body especially in encoding affect. The talk will draw data from various domains notably in health to illustrate some of the applications.
Abstract: How can we create technologies to help us reflect on and change our behavior, improving our health and overall wellbeing? In this talk, I will briefly describe the last several years of work our research team has been doing in this area. We have developed wearable technology to help families manage tense situations with their children, mobile phone-based applications for handling stress and depression, as well as logging tools that can help you stay focused or recommend good times to take a break at work. The overarching goal in all of this research is to develop tools that adapt to the user so that they can maximize their productivity and improve their health.
What are facial expressions for? In social-functional accounts, they are efficient adaptations that are used flexibly to address the problems inherent to successful social living. Facial expressions both broadcast emotions and regulate the emotions of perceivers. Research from my laboratory focuses on the human smile and demonstrates how this very nuanced display varies in its physical form in order to solve three basic social challenges: rewarding others, signaling non-threat, and negotiating social hierarchies. We mathematically modeled the dynamic facial-expression patterns of reward, affiliation, and dominance smiles using a data-driven approach that combined a dynamic facial expression generator with methods of reverse correlation. The resulting models were validated using human-perceiver and Bayesian classifiers. Human smile stimuli were also developed and validated in studies in which distinct effects of the smiles on physiological and hormonal processes were observed. The social-function account is extended to the acoustic form of laughter and is used to address questions about cross-cultural differences in emotional expression.
Humans interact with the world using five major senses: sight, hearing, touch, smell, and taste. Almost all interaction with the environment is naturally multimodal, as audio, tactile or paralinguistic cues provide confirmation for physical actions and spoken language interaction. Multimodal interaction seeks to fully exploit these parallel channels for perception and action to provide robust, natural interaction. Richard Bolt's "Put That There" (1980) provided an early paradigm that demonstrated the power of multimodality and helped attract researchers from a variety of disciplines to study a new approach for post-WIMP computing that moves beyond desktop graphical user interfaces (GUI). In this talk, I will look back to the origins of the scientific community of multimodal interaction, and review some of the more salient results that have emerged over the last 20 years, including results in machine perception, system architectures, visualization, and computer to human communications. Recently, a number of game-changing technologies such as deep learning, cloud computing, and planetary scale data collection have emerged to provide robust solutions to historically hard problems. As a result, scientific understanding of multimodal interaction has taken on new relevance as construction of practical systems has become feasible. I will discuss the impact of these new technologies and the opportunities and challenges that they raise. I will conclude with a discussion of the importance of convergence with cognitive science and cognitive systems to provide foundations for intelligent, human-centered interactive systems that learn and fully understand humans and human-to-human social interaction, in order to provide services that surpass the abilities of the most intelligent human servants.
We present dialogue management routines for a system to engage in multiparty agent-infant interaction. The ultimate purpose of this research is to help infants learn a visual sign language by engaging them in naturalistic and socially contingent conversations during an early-life critical period for language development (ages 6 to 12 months) as initiated by an artificial agent. As a first step, we focus on creating and maintaining agent-infant engagement that elicits appropriate and socially contingent responses from the baby. Our system includes two agents, a physical robot and an animated virtual human. The system's multimodal perception includes an eye-tracker (measures attention) and a thermal infrared imaging camera (measures patterns of emotional arousal). A dialogue policy is presented that selects individual actions and planned multiparty sequences based on perceptual inputs about the baby's internal changing states of emotional engagement. The present version of the system was evaluated in interaction with 8 babies. All babies demonstrated spontaneous and sustained engagement with the agents for several minutes, with patterns of conversationally relevant and socially contingent behaviors. We further performed a detailed case-study analysis with annotation of all agent and baby behaviors. Results show that the baby's behaviors were generally relevant to agent conversations and contained direct evidence for socially contingent responses by the baby to specific linguistic samples produced by the avatar. This work demonstrates the potential for language learning from agents in very young babies and has especially broad implications regarding the use of artificial agents with babies who have minimal language exposure in early life.
We address the problem of automatically predicting group performance on a task, using multimodal features derived from the group conversation. These include acoustic features extracted from the speech signal, and linguistic features derived from the conversation transcripts. Because much work on social signal processing has focused on nonverbal features such as voice prosody and gestures, we explicitly investigate whether features of linguistic content are useful for predicting group performance. The conclusion is that the best-performing models utilize both linguistic and acoustic features, and that linguistic features alone can also yield good performance on this task. Because there is a relatively small amount of task data available, we present experimental approaches using domain adaptation and a simple data augmentation method, both of which yield drastic improvements in predictive performance, compared with a target-only model.
We model coordination and coregulation patterns in 33 triads engaged in collaboratively solving a challenging computer programming task for approximately 20 minutes. Our goal is to prospectively model speech rate (words/sec) - an important signal of turn taking and active participation - of one teammate (A or B or C) from time lagged nonverbal signals (speech rate and acoustic-prosodic features) of the other two (i.e., A + B → C; A + C → B; B + C → A) and task-related context features. We trained feed-forward neural networks (FFNNs) and long short-term memory recurrent neural networks (LSTMs) using group-level nested cross-validation. LSTMs outperformed FFNNs and a chance baseline and could predict speech rate up to 6s into the future. A multimodal combination of speech rate, acoustic-prosodic, and task context features outperformed unimodal and bimodal signals. The extent to which the models could predict an individual's speech rate was positively related to that individual's scores on a subsequent posttest, suggesting a link between coordination/coregulation and collaborative learning outcomes. We discuss applications of the models for real-time systems that monitor the collaborative process and intervene to promote positive collaborative outcomes.
We explored the gaze behavior towards the end of utterances and dialogue act (DA), i.e., verbal-behavior information indicating the intension of an utterance, during turn-keeping/changing to estimate empathy skill levels in multiparty discussions. This is the first attempt to explore the relationship between such a combination. First, we collected data on Davis' Interpersonal Reactivity Index (which measures empathy skill level), utterances that include the DA categories of Provision, Self-disclosure, Empathy, Turn-yielding, and Others, and gaze behavior from participants in four-person discussions. The results of analysis indicate that the gaze behavior accompanying utterances that include these DA categories during turn-keeping/changing differs in accordance with people's empathy skill levels. The most noteworthy result was that speakers with low empathy skill levels tend to avoid making eye contact with the listener when the DA category is Self-disclosure during turn-keeping. However, they tend to maintain eye contact when the DA category is Empathy. A listener who has a high empathy skill level often looks away from the speaker during turn-changing when the DA category of a speaker's utterance is Provision or Empathy. There was also no difference in gaze behavior between empathy skill levels when the DA category of the speaker's utterance was turn-yielding. From these findings, we constructed and evaluated models for estimating empathy skill level using gaze behavior and DA information. The evaluation results indicate that using both gaze behavior and DA during turn-keeping/changing is effective for estimating an individual's empathy skill level in multi-party discussions.
Automated measurement of affective behavior in psychopathology has been limited primarily to screening and diagnosis. While useful, clinicians more often are concerned with whether patients are improving in response to treatment. Are symptoms abating, is affect becoming more positive, are unanticipated side effects emerging? When treatment includes neural implants, need for objective, repeatable biometrics tied to neurophysiology becomes especially pressing. We used automated face analysis to assess treatment response to deep brain stimulation (DBS) in two patients with intractable obsessive-compulsive disorder (OCD). One was assessed intraoperatively following implantation and activation of the DBS device. The other was assessed three months post-implantation. Both were assessed during DBS on and off conditions. Positive and negative valence were quantified using a CNN trained on normative data of 160 non-OCD participants. Thus, a secondary goal was domain transfer of the classifiers. In both contexts, DBS-on resulted in marked positive affect. In response to DBS-off, affect flattened in both contexts and alternated with increased negative affect in the outpatient setting. Mean AUC for domain transfer was 0.87. These findings suggest that parametric variation of DBS is strongly related to affective behavior and may introduce vulnerability for negative affect in the event that DBS is discontinued.
Smell is a powerful tool for conveying and recalling information without requiring visual attention. Previous work identified, however, some challenges caused by user's unfamiliarity with this modality and complexity in the scent delivery. We are now able to overcome these challenges, introducing a training approach to familiarise scent-meaning associations (urgency of a message, and sender identity) and using a controllable device for the scent-delivery. Here we re-validate the effectiveness of smell as notification modality and present findings on the performance of smell in conveying information. In a user study composed of two sessions, we compared the effectiveness of visual, olfactory, and combined visual-olfactory notifications in a messaging application. We demonstrated that olfactory notifications improve users' confidence and performance in identifying the urgency level of a message, with the same reaction time and disruption levels as for visual notifications. We discuss the design implications and opportunities for future work in the domain of multimodal interactions.
Automatic emotion recognition has long been developed by concentrating on modeling human expressive behavior. At the same time, neuro-scientific evidences have shown that the varied neuro-responses (i.e., blood oxygen level-dependent (BOLD) signals measured from the functional magnetic resonance imaging (fMRI)) is also a function on the types of emotion perceived. While past research has indicated that fusing acoustic features and fMRI improves the overall speech emotion recognition performance, obtaining fMRI data is not feasible in real world applications. In this work, we propose a cross modality adversarial network that jointly models the bi-directional generative relationship between acoustic features of speech samples and fMRI signals of human percetual responses by leveraging a parallel dataset. We encode the acoustic descriptors of a speech sample using the learned cross modality adversarial network to generate the fMRI-enriched acoustic vectors to be used in the emotion classifier. The generated fMRI-enriched acoustic vector is evaluated not only in the parallel dataset but also in an additional dataset without fMRI scanning. Our proposed framework significantly outperform using acoustic features only in a four-class emotion recognition task for both datasets, and the use of cyclic loss in learning the bi-directional mapping is also demonstrated to be crucial in achieving improved recognition rates.
Despite the great potential, Massive Open Online Courses (MOOCs) face major challenges such as low retention rate, limited feedback, and lack of personalization. In this paper, we report the results of a longitudinal study on AttentiveReview2, a multimodal intelligent tutoring system optimized for MOOC learning on unmodified mobile devices. AttentiveReview2 continuously monitors learners' physiological signals, facial expressions, and touch interactions during learning and recommends personalized review materials by predicting each learner's perceived difficulty on each learning topic. In a 3-week study involving 28 learners, we found that AttentiveReview2 on average improved learning gains by 21.8% in weekly tests. Follow-up analysis shows that multi-modal signals collected from the learning process can also benefit instructors by providing rich and fine-grained insights on the learning progress. Taking advantage of such signals also improves prediction accuracies in emotion and test scores when compared with clickstream analysis.
The aim was to study if odors evaporated by an olfactory display prototype can be used to affect participants' cognitive and emotionrelated responses to audio-visual stimuli, and whether the display can benefit from objective measurement of the odors. The results showed that odors and videos had significant effects on participants' responses. For instance, odors increased pleasantness ratings especially when the odor was authentic and the video was congurent with odors. The objective measurement of the odors was shown to be useful. The measurement data was classified with 100% accuracy removing the need to speculate whether the odor presentation apparatus is working properly.
The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and unstructured discourse. Our goal is to determine how much a generalized model trained on many types of dialogue scenarios would improve on a model trained only for a specific scenario. To achieve this goal we created a large corpus of Wizard-of-Oz conversation data which consisted of several different types of dialogue sessions, and then compared a generalized model with scenario-specific models. For our evaluation we go further than simply reporting conventional metrics, which we show are not informative enough to evaluate turn-taking in a real-time system. Instead, we process results using a performance curve of latency and false cut-in rate, and further improve our model's real-time performance using a finite-state turn-taking machine. Our results show that the generalized model greatly outperformed the individual model for attentive listening scenarios but was worse in job interview scenarios. This implies that a model based on a large corpus is better suited to conversation which is more user-initiated and unstructured. We also propose that our method of evaluation leads to more informative performance metrics in a real-time system.
This paper presents a summary and critical reflection on ten major opportunities and challenges for advancing the field of multimodal learning analytics (MLA). It identifies emerging technology trends likely to disrupt learning analytics, challenges involved in forging viable participatory design partnerships, and impending issues associated with the control of data and privacy. Trends in health care analytics provide one attractive model for how new infrastructure can enable the collection of largerscale and more diverse datasets, and how end-user analytics can be designed to empower individuals and expand market adoption.
Digital home assistants have an increasing influence on our everyday lives. The media now reports how children adapt the consequential, imperious language style when talking to real people. As a response to this behavior, we considered a digital assistant rebuking impolite language. We then investigated how adult users react when being rebuked by the AI. In a between-group study (N = 20), the participants were being rejected by our fictional speech assistant "Eliza" when they made impolite requests. As a result, we observed more polite behavior. Most test subjects accepted the AI's demand and said "please" significantly more often. However, many participants retrospectively denied Eliza the entitlement to politeness and criticized her attitude or refusal of service.
This article tackles the issue of the detection of the user's likes and dislikes in a negotiation with a virtual agent for helping the creation of a model of user's preferences. We introduce a linguistic model of user's likes and dislikes as they are expressed in a negotiation context. The identification of syntactic and semantic features enables the design of formal grammars embedded in a bottom-up and rule-based system. It deals with conversational context by considering simple and collaborative likes and dislikes within adjacency pairs. We present the annotation campaign we conduct by recruiting annotators on CrowdFlower and using a dedicated annotation platform. Finally, we measure agreement between our system and the human reference. The obtained scores show substantial agreement.
Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate that the fusion strategy can easily generalise to many other multimodal tasks which involve correlated modalities.
Body posture is a good indicator of, amongst other things, people's state of arousal, focus of attention and level of interest in a conversation. Posture is conventionally measured by observation and hand coding of videos or, more recently, through automated computer vision and motion capture techniques. Here we introduce a novel alternative approach exploiting a new modality: posture classification using bespoke 'smart' trousers with integrated textile pressure sensors. Changes in posture translate to changes in pressure patterns across the surface of our clothing. We describe the construction of the textile pressure sensor and, using simple machine learning techniques on data gathered from 10 participants, demonstrate its ability to discriminate between 19 different basic posture types with high accuracy. This technology has the potential to support anonymous, unintrusive sensing of interest, attention and engagement in a wide variety of settings.
Nearest neighbor classifiers recognize stroke gestures by computing a (dis)similarity between a candidate gesture and a training set based on points, which may require normalization, resampling, and rotation to a reference before processing. To eliminate this expensive preprocessing, this paper introduces a vector-between-vectors recognition where a gesture is defined by a vector based on geometric algebra and performs recognition by computing a novel Local Shape Distance (LSD) between vectors. We mathematically prove the LSD position, scale, and rotation invariance, thus eliminating the preprocessing. To demonstrate the viability of this approach, we instantiate LSD for n=2 to compare !FTL, a 2D stroke-gesture recognizer with respect to $1 and $P, two state-of-the-art gesture recognizers, on a gesture set typically used for benchmarking. !FTL benefits from a recognition rate similar to $P, but a significant smaller execution time and a lower algorithmic complexity.
Combining mid-air gestures with pen input for bi-manual input on tablets has been reported as an alternative and attractive input technique in drawing applications. Previous work has also argued that mid-air gestural input can cause discomfort and arm fatigue over time, which can be addressed in a desktop setting by allowing users to gesture in alternative restful arm positions (e.g., elbow rests on desk). However, it is unclear if and how gesture preferences and gesture designs would be different for alternative arm positions. In order to inquire these research question we report on a user and choice based gesture elicitation study in which 10 participants designed gestures for different arm positions. We provide an in-depth qualitative analysis and detailed categorization of gestures, discussing commonalities and differences in the gesture sets based on a "think aloud" protocol, video recordings, and self-reports on user preferences.
During medical interventions, direct interaction with medical image data is a cumbersome task for physicians due to the sterile environment. Even though touchless input via hand, foot or voice is possible, these modalities are not available for these tasks all the time. Therefore, we investigated touchless input methods as alternatives to each other with focus on two common interaction tasks in sterile settings: activation of a system to avoid unintentional input and manipulation of continuous values. We created a system where activation could be achieved via voice, hand or foot gestures and continuous manipulation via hand and foot gestures. We conducted a comparative user study and found that foot interaction performed best in terms of task completion times and scored highest in the subjectively assessed measures usability and usefulness. Usability and usefulness scores for hand and voice were only slightly worse and all participants were able to perform all tasks in a sufficient short amount of time. This work contributes by proposing methods to interact with computers in sterile, dynamic environments and by providing evaluation results for direct comparison of alternative modalities for common interaction tasks.
A shared sense of humor can result in positive feelings associated with amusement, laughter, and moments of bonding. If robotic companions could acquire their human counterparts' sense of humor in an unobtrusive manner, they could improve their skills of engagement. In order to explore this assumption, we have developed a dynamic user modeling approach based on Reinforcement Learning, which allows a robot to analyze a person's reaction while it tells jokes and continuously adapts its sense of humor. We evaluated our approach in a test scenario with a Reeti robot acting as an entertainer and telling different types of jokes. The exemplary adaptation process is accomplished only by using the audience's vocal laughs and visual smiles, but no other form of explicit feedback. We report on results of a user study with 24 participants, comparing our approach to a baseline condition (with a non-learning version of the robot) and conclude by providing limitations and implications of our approach in detail.
Small group interaction occurs often in workplace and education settings. Its dynamic progression is an essential factor in dictating the final group performance outcomes. The personality of each individual within the group is reflected in his/her interpersonal behaviors with other members of the group as they engage in these task-oriented interactions. In this work, we propose an interlocutor-modulated attention BSLTM (IM-aBLSTM) architecture that models an individual's vocal behaviors during small group interactions in order to automatically infer his/her personality traits. The interlocutor-modulated attention mechanism jointly optimize the relevant interpersonal vocal behaviors of other members of group during interactions. In specifics, we evaluate our proposed IM-aBLSTM in one of the largest small group interaction database, the ELEA corpus. Our framework achieves a promising unweighted recall accuracy of 87.9% in ten different binary personality trait prediction tasks, which outperforms the best results previously reported on the same database by 10.4% absolute. Finally, by analyzing the interpersonal vocal behaviors in the region of high attention weights, we observe several distinct intra- and inter-personal vocal behavior patterns that vary as a function of personality traits.
Psychotic disorders are forms of severe mental illness characterized by abnormal social function and a general sense of disconnect with reality. The evaluation of such disorders is often complex, as their multifaceted nature is often difficult to quantify. Multimodal behavior analysis technologies have the potential to help address this need and supply timelier and more objective decision support tools in clinical settings. While written language and nonverbal behaviors have been previously studied, the present analysis takes the novel approach of examining the rarely-studied modality of spoken language of individuals with psychosis as naturally used in social, face-to-face interactions. Our analyses expose a series of language markers associated with psychotic symptom severity, as well as interesting interactions between them. In particular, we examine three facets of spoken language: (1) lexical markers, through a study of the function of words; (2) structural markers, through a study of grammatical fluency; and (3) disfluency markers, through a study of dialogue self-repair. Additionally, we develop predictive models of psychotic symptom severity, which achieve significant predictive power on both positive and negative psychotic symptom scales. These results constitute a significant step toward the design of future multimodal clinical decision support tools for computational phenotyping of mental illness.
Constructing computational models of interactions during Forensic Interviews (FI) with children presents a unique challenge in being able to maximize complete and accurate information disclosure, while minimizing emotional trauma experienced by the child. Leveraging multiple channels of observational signals, dynamical system modeling is employed to track and identify patterns in the influence interviewers' linguistic and paralinguistic behavior has on children's verbal recall productivity. Specifically, linear mixed effects modeling and dynamical mode decomposition allow for robust analysis of acoustic-prosodic features, aligned with lexical features at turn-level utterances. By varying the window length, the model parameters evaluate both interviewer and child behaviors at different temporal resolutions, thus capturing both rapport-building and disclosure phases of FI. Making use of a recently proposed definition of productivity, the dynamic systems modeling provides insight into the characteristics of interaction that are most relevant to effectively eliciting narrative and task-relevant information from a child.
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.
Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction , in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.
In this paper we focus on detection of deception and suspicion from electrodermal activity (EDA) measured on left and right wrists during a dyadic game interaction. We aim to answer three research questions: (i) Is it possible to reliably distinguish deception from truth based on EDA measurements during a dyadic game interaction? (ii) Is it possible to reliably distinguish the state of suspicion from trust based on EDA measurements during a card game? (iii) What is the relative importance of EDA measured on left and right wrists? To answer our research questions we conducted a study in which 20 participants were playing the game Cheat in pairs with one EDA sensor placed on each of their wrists. Our experimental results show that EDA measures from left and right wrists provide more information for suspicion detection than for deception detection and that the person-dependent detection is more reliable than the person-independent detection. In particular, classifying the EDA signal with Support Vector Machine (SVM) yields accuracies of 52% and 57% for person-independent prediction of deception and suspicion respectively, and 63% and 76% for person-dependent prediction of deception and suspicion respectively. Also, we found that: (i) the optimal interval of informative EDA signal for deception detection is about 1 s while it is around 3.5 s for suspicion detection; (ii) the EDA signal relevant for deception/suspicion detection can be captured after around 3.0 seconds after a stimulus occurrence regardless of the stimulus type (deception/truthfulness/suspicion/trust); and that (iii) features extracted from EDA from both wrists are important for classification of both deception and suspicion. To the best of our knowledge, this is the first work that uses EDA data to automatically detect both deception and suspicion in a dyadic game interaction setting.
Emotion evoked by an advertisement plays a key role in influencing brand recall and eventual consumer choices. Automatic ad affect recognition has several useful applications. However, the use of content-based feature representations does not give insights into how affect is modulated by aspects such as the ad scene setting, salient object attributes and their interactions. Neither do such approaches inform us on how humans prioritize visual information for ad understanding. Our work addresses these lacunae by decomposing video content into detected objects, coarse scene structure, object statistics and actively attended objects identified via eye-gaze. We measure the importance of each of these information channels by systematically incorporating related information into ad affect prediction models. Contrary to the popular notion that ad affect hinges on the narrative and the clever use of linguistic and social cues, we find that actively attended objects and the coarse scene structure better encode affective information as compared to individual scene objects or conspicuous background elements.
Laughter is a highly spontaneous behavior that frequently occurs during social interactions. It serves as an expressive-communicative social signal which conveys a large spectrum of affect display. Even though many studies have been performed on the automatic recognition of laughter -- or emotion -- from audiovisual signals, very little is known about the automatic recognition of emotion conveyed by laughter. In this contribution, we provide insights on emotional laughter by extensive evaluations carried out on a corpus of dyadic spontaneous interactions, annotated with dimensional labels of emotion (arousal and valence). We evaluate, by automatic recognition experiments and correlation based analysis, how different categories of laughter, such as unvoiced laughter, voiced laughter, speech laughter, and speech (non-laughter) can be differentiated from audiovisual features, and to which extent they might convey different emotions. Results show that voiced laughter performed best in the automatic recognition of arousal and valence for both audio and visual features. The context of production is further analysed and results show that, acted and spontaneous expressions of laughter produced by a same person can be differentiated from audiovisual signals, and multilingual induced expressions can be differentiated from those produced during interactions.
The inherent diversity of human behavior limits the capabilities of general large-scale machine learning systems, that usually require ample amounts of data to provide robust descriptors of the outcomes of interest. Motivated by this challenge, personalized and population-specific models comprise a promising line of work for representing human behavior, since they can make decisions for clusters of people with common characteristics, reducing the amount of data needed for training. We propose a multi-task learning (MTL) framework for developing population-specific models of interpersonal conflict between couples using ambulatory sensor and mobile data from real-life interactions. The criteria for population clustering include global indices related to couples' relationship quality and attachment style, person-specific factors of partners' positivity, negativity, and stress levels, as well as fluctuating factors of daily emotional arousal obtained from acoustic and physiological indices. Population-specific information is incorporated through a MTL feed-forward neural network (FF-NN), whose first layers capture the common information across all data samples, while its last layers are specific to the unique characteristics of each population. Our results indicate that the proposed MTL FF-NN trained solely on the sensor-based acoustic, linguistic, and physiological modalities provides unweighted and weighted F1-scores of 0.51 and 0.75, respectively, outperforming the corresponding baselines of a single general FF-NN trained on the entire dataset and separate FF-NNs trained on each population cluster individually. These demonstrate the feasibility of such ambulatory systems for detecting real-life behaviors and possibly intervening upon them, and highlights the importance of taking into account the inherent diversity of different populations from the general pool of data.
Cars provide drivers with task-related information (e.g. "Fill gas") mainly using visual and auditory stimuli. However, those stimuli may distract or overwhelm the driver, causing unnecessary stress. Here, we propose olfactory stimulation as a novel feedback modality to support the perception of visual notifications, reducing the visual demand of the driver. Based on previous research, we explore the application of the scents of lavender, peppermint, and lemon to convey three driving-relevant messages (i.e. "Slow down", "Short inter-vehicle distance", "Lane departure"). Our paper is the first to demonstrate the application of olfactory conditioning in the context of driving and to explore how multiple olfactory notifications change the driving behaviour. Our findings demonstrate that olfactory notifications are perceived as less distracting, more comfortable, and more helpful than visual notifications. Drivers also make less driving mistakes when exposed to olfactory notifications. We discuss how these findings inform the design of future in-car user interfaces.
In this work we analyze the importance of lexical and acoustic modalities in behavioral expression and perception. We demonstrate that this importance relates to the amount of therapy, and hence communication training, that a person received. It also exhibits some relationship to gender. We proceed to provide an analysis on couple therapy data by splitting the data into clusters based on gender or stage in therapy. Our analysis demonstrates the significant difference between optimal modality weights per cluster and relationship to therapy stage. Given this finding we propose the use of communication-skill aware fusion models to account for these differences in modality importance. The fusion models operate on partitions of the data according to the gender of the speaker or the therapy stage of the couple. We show that while most multimodal fusion methods can improve mean absolute error of behavioral estimates, the best results are given by a model that considers the degree of communication training among the interlocutors.
Older adults want to live independently and at the same time stay socially active. We conducted contextual inquiry to understand what usability problems they face while interacting with social media on touch screen devices. We found that it is hard for active older adults to understand and learn mobile social media interfaces due to lack of support of safe interface exploration and insufficient cognitive affordances. We designed TapTag to enhance the learnability for older adults on touch screens. TapTag is an assistive gestural interaction model that utilizes muti-step single taps. TapTag breaks the interaction process to 2 steps, one gesture is to explore the user interface (UI), and the second is to activate the functionality of the UI element. We prototyped TapTag as an overlay on the top of Facebook app. We conducted a comparative study where older adults used Facebook app and Facebook with TapTag. The results showed that Facebook with TapTag provided a better user experience for older adults in terms of learnability, accessibility, and ease of use.
The user experience (UX) of graphical user interfaces (GUIs) often depends on how clearly visual designs communicate/signify "affordances", such as if an element on the screen can be pushed, dragged, or rotated. Especially for novice users figuring out the complexity of a new interface can be cumbersome. In the "past" era of mouse-based interaction mouseover effects were successfully utilized to trigger a variety of assistance, and help users in exploring interface elements without causing unintended interactions and associated negative experiences. Today's GUIs are increasingly designed for touch and lack a method similiar to mouseover to help (novice) users to get acquainted with interface elements. In order to address this issue, we have studied gazeover, as a technique for triggering "help or guidance" when a user's gaze is over an interactive element, which we believe is suitable for today's touch interfaces. We report on a user study comparing pragmatic and hedonic qualities of gazeover and mouseover, which showed significant higher ratings in hedonic quality for the gazeover technique. We conclude by discussing limitations and implications of our findings.
In this paper, we extract features of head pose, eye gaze, and facial expressions from video to estimate individual learners' attentional states in a classroom setting. We concentrate on the analysis of different definitions for a student's attention and show that available generic video processing components and a single video camera are sufficient to estimate the attentional state.
There are many mechanisms to sense arousal. Most of them are either intrusive, prone to bias, costly, require skills to set-up or do not provide additional context to the user's measure of arousal. We present arousal detection through the analysis of pupillary response from eye trackers. Using eye-trackers, the user's focal attention can be detected with high fidelity during user interaction in an unobtrusive manner. To evaluate this, we displayed twelve images of varying arousal levels rated by the International Affective Picture System (IAPS) to 41 participants while they reported their arousal levels. We found a moderate correlation between the self-reported arousal and the algorithm's arousal rating, r(47)=0.46, p<.01. The results show that eye trackers can serve as a multi-sensory device for measuring arousal, and relate the level of arousal to the user's focal attention. We anticipate that in the future, high fidelity web cameras can be used to detect arousal in relation to user attention, to improve usability, UX and understand visual behaviour.
We present PathWord (PATH passWORD), a multimodal digit entry method for ad-hoc authentication based on known digits shape and user relative eye movements. PathWord is a touch-free, gaze-based input modality, which attempts to decrease shoulder surfing attacks when unlocking a system using PINs. The system uses a modified web camera to detect the user's eye. This enables suppressing direct touch, making it difficult for passer-bys to be aware of the input digits, thus reducing shoulder surfing and smudge attacks. In addition to showing high accuracy rates (Study 1: 87.1% successful entries) and strong confidentiality through detailed evaluations with 42 participants (Study 2), we demonstrate how PathWord considerably diminishes the potential of stolen passwords (on average 2.38% stolen passwords with PathWord vs. over 90% with traditional PIN screen). We show use-cases of PathWord and discuss its advantages over traditional input modalities. We envision PathWord as a method to foster confidence while unlocking a system through gaze gestures.
Smart watches can enrich everyday interactions by providing both glanceable information and instant access to frequent tasks. However, reading text messages on a 1.5-inch small screen is inherently challenging, especially when a user's attention is divided. We present SmartRSVP, an attentive speed-reading system to facilitate text reading on small-screen wearable devices. SmartRSVP leverages camera-based visual attention tracking and implicit physiological signal sensing to make text reading via Rapid Serial Visual Presentation (RSVP) more enjoyable and practical on smart watches. Through a series of three studies involving 40 participants, we found that 1) SmartRSVP can achieve a significantly higher comprehension rate (57.5% vs. 23.9%) and perceived comfort (3.8 vs. 2.1) than traditional RSVP; 2) Users prefer SmartRSVP over traditional reading interfaces when they walk and read; 3) SmartRSVP can predict users' cognitive workloads and adjust the reading speed accordingly in real-time with 83.3% precision.
Despite the ubiquity and rapid growth of mobile reading activities, researchers and practitioners today either rely on coarse-grained metrics such as click-through-rate (CTR) and dwell time, or expensive equipment such as gaze trackers to understand users' reading behavior on mobile devices. We present Lepton, an intelligent mobile reading system and a set of dual-channel sensing algorithms to achieve scalable and fine-grained understanding of users' reading behaviors, comprehension, and engagements on unmodified smartphones. Lepton tracks the periodic lateral patterns, i.e. saccade, of users' eye gaze via the front camera, and infers their muscle stiffness during text scrolling via a Mass-Spring-Damper (MSD) based kinematic model from touch events. Through a 25-participant study, we found that both the periodic saccade patterns and muscle stiffness signals captured by Lepton can be used as expressive features to infer users' comprehension and engagement in mobile reading. Overall, our new signals lead to significantly higher performances in predicting users' comprehension (correlation: 0.36 vs. 0.29), concentration (0.36 vs. 0.16), confidence (0.5 vs. 0.47), and engagement (0.34 vs. 0.16) than using traditional dwell-time based features via a user-independent model.
Motivated by the desire to give vehicles better information about their drivers, we explore human intent inference in the setting of a human driver riding in a moving vehicle. Specifically, we consider scenarios in which the driver intends to go to or learn about a specific point of interest along the vehicle's route, and an autonomous system is tasked with inferring this point of interest using gaze cues. Because the scene under observation is highly dynamic --- both the background and objects in the scene move independently relative to the driver --- such scenarios are significantly different from the static scenes considered by most literature in the eye tracking community. In this paper, we provide a formulation for this new problem of determining a point of interest in a dynamic scenario. We design an experimental framework to systematically evaluate initial solutions to this novel problem, and we propose our own solution called dynamic interest point detection (DIPD). We experimentally demonstrate the success of DIPD when compared to baseline nearest-neighbor or filtering approaches.
In this paper, we introduce a novel gaze-only interaction technique called EyeLinks, which was designed i) to support various types of discrete clickables (e.g. textual links, buttons, images, tabs, etc.); ii) to be easy to learn and use; iii) to mitigate the inaccuracy of affordable eye trackers. Our technique uses a two-step fixation approach: first, we assign numeric identifiers to clickables in the region where users gaze at and second, users select the desired clickable by performing a fixation on the corresponding confirm button, displayed in a sidebar. This two-step selection enables users to freely explore Web pages, avoids the Midas touch problem and improves accuracy.
We evaluated our approach by comparing it against the mouse and another gaze-only technique (Actigaze). The results showed no statistically significant difference between EyeLinks and Actigaze, but users considered EyeLinks easier to learn and use than Actigaze and it was also the most preferred. Of the three, the mouse was the most accurate and efficient technique.
Data Visualization has been receiving growing attention recently, with ubiquitous smart devices designed to render information in a variety of ways. However, while evaluations of visual tools for their interpretability and intuitiveness have been commonplace, not much research has been devoted to other forms of data rendering, \eg, sonification. This work is the first to automatically estimate the cognitive load induced by different acoustic parameters considered for sonification in prior studies~\citeferguson2017evaluation,ferguson2018investigating. We examine cognitive load via (a) perceptual data-sound mapping accuracies of users for the different acoustic parameters, (b) cognitive workload impressions explicitly reported by users, and (c) their implicit EEG responses compiled during the mapping task. Our main findings are that (i) low cognitive load-inducing (ıe, more intuitive) acoustic parameters correspond to higher mapping accuracies, (ii) EEG spectral power analysis reveals higher α band power for low cognitive load parameters, implying a congruent relationship between explicit and implicit user responses, and (iii) Cognitive load classification with EEG features achieves a peak F1-score of 0.64, confirming that reliable workload estimation is achievable with user EEG data compiled using wearable sensors.
The rising prevalence of mental illnesses is increasing the demand for new digital tools to support mental wellbeing. Numerous collaborations spanning the fields of psychology, machine learning and health are building such tools. Machine-learning models that estimate effects of mental health interventions currently rely on either user self-reports or measurements of user physiology. In this paper, we present a multimodal approach that combines self-reports from questionnaires and skin conductance physiology in a web-based trauma-recovery regime. We evaluate our models on the EASE multimodal dataset and create PTSD symptom severity change estimators at both total and cluster-level. We demonstrate that modeling the PTSD symptom severity change at the total-level with self-reports can be statistically significantly improved by the combination of physiology and self-reports or just skin conductance measurements. Our experiments show that PTSD symptom cluster severity changes using our novel multimodal approach are significantly better modeled than using self-reports and skin conductance alone when extracting skin conductance features from triggers modules for avoidance, negative alterations in cognition & mood and alterations in arousal & reactivity symptoms, while it performs statistically similar for intrusion symptom.
Quantitative analysis of gazes between a speaker and listeners was conducted from the viewpoint of mutual activities in floor apportionment, with the assumption that mutual gaze plays an important role in coordinating speech interaction. We conducted correlation analyses of the speaker's and listener's gazes in a three-party conversation, comparing native language (L1) and second language (L2) interaction in two types (free-flowing and goal-orient- ed). The analyses showed significant correlations between gazes from the current to the next speaker and those from the next to the current speaker during utterances preceding a speaker change in L1 conversation, suggesting that the participants were coordinating their speech turns with mutual gazes. In L2 conversation, however, such a correlation was found only in the goal-oriented type, suggesting that linguistic proficiency may affect the floor-apportionment function of mutual gazes, possibly because of the cognitive load of understanding/producing utterances.
The recent availability of lightweight, wearable cameras allows for collecting video data from a "first-person' perspective, capturing the visual world of the wearer in everyday interactive contexts. In this paper, we investigate how to exploit egocentric vision to infer multimodal behaviors from people wearing head-mounted cameras. More specifically, we estimate head (camera) motion from egocentric video, which can be further used to infer non-verbal behaviors such as head turns and nodding in multimodal interactions. We propose several approaches based on Convolutional Neural Networks (CNNs) that combine raw images and optical flow fields to learn to distinguish regions with optical flow caused by global ego-motion from those caused by other motion in a scene. Our results suggest that CNNs do not directly learn useful visual features with end-to-end training from raw images alone; instead, a better approach is to first extract optical flow explicitly and then train CNNs to integrate optical flow and visual information.
Group meetings can suffer from serious problems that undermine performance, including bias, "groupthink", fear of speaking, and unfocused discussion. To better understand these issues, propose interventions, and thus improve team performance, we need to study human dynamics in group meetings. However, this process currently heavily depends on manual coding and video cameras. Manual coding is tedious, inaccurate, and subjective, while active video cameras can affect the natural behavior of meeting participants. Here, we present a smart meeting room that combines microphones and unobtrusive ceiling-mounted Time-of-Flight (ToF) sensors to understand group dynamics in team meetings. We automatically process the multimodal sensor outputs with signal, image, and natural language processing algorithms to estimate participant head pose, visual focus of attention (VFOA), non-verbal speech patterns, and discussion content. We derive metrics from these automatic estimates and correlate them with user-reported rankings of emergent group leaders and major contributors to produce accurate predictors. We validate our algorithms and report results on a new dataset of lunar survival tasks of 36 individuals across 10 groups collected in the multimodal-sensor-enabled smart room.
Motivational Interviewing (MI) is a widely disseminated and effective therapeutic approach for behavioral disorder treatment. Over the past decade, MI research has identified client language as a central mediator between therapist skills and subsequent behavior change. Specifically, in-session client language referred to as change talk (CT; personal arguments for change) or sustain talk (ST; personal argument against changing the status quo) has been directly related to post-session behavior change. Despite the prevalent use of MI and extensive studies of MI underlying mechanisms, most existing studies focus on the linguistic aspect of MI, especially of client change talk and sustain talk and how they as a mediator influence the outcome of MI. In this study, we perform statistical analyses on acoustic behavior descriptors to test their discriminatory powers. Then we utilize multimodality by combining acoustic features with linguistic features to improve the accuracy of client change talk prediction. Lastly, we investigate into our trained model to understand what features inform the model about client utterance class and gain insights into the nature of MISC codes.
We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.
This paper presents a novel approach in continuous emotion prediction that characterizes dimensional emotion labels jointly with continuous and discretized representations. Continuous emotion labels can capture subtle emotion variations, but their inherent noise often has negative effects on model training. Recent approaches found a performance gain when converting the continuous labels into a discrete set (e.g., using k-means clustering), despite a label quantization error. To find the optimal trade-off between the continuous and discretized emotion representations, we investigate two joint modeling approaches: ensemble and end-to-end. The ensemble model combines the predictions from two models that are trained separately, one with discretized prediction and the other with continuous prediction. On the other hand, the end-to-end model is trained to simultaneously optimize both discretized and continuous prediction tasks in addition to the final combination between them. Our experimental results using the state-of-the-art deep BLSTM network on the RECOLA dataset demonstrate that (i) the joint representation outperforms both individual representation baselines and the state-of-the-art speech based results on RECOLA, validating the assumption that combining continuous and discretized emotion representations yields better performance in emotion prediction; and (ii) the joint representation can help to accelerate convergence, particularly for valence prediction. Our work provides insights into joint discrete and continuous emotion representation and its efficacy for describing dynamically changing affective behavior in valence and activation prediction.
Within the affective computing and social signal processing communities, increasing efforts are being made in order to collect data with genuine (emotional) content. When it comes to negative emotions and even aggression, ethical and privacy related issues prevent the usage of many emotion elicitation methods, and most often actors are employed to act out different scenarios. Moreover, for most databases, emotional arousal is not explicitly checked, and the footage is annotated by external raters based on observable behavior. In the attempt to gather data a step closer to real-life, previous work proposed an elicitation method for collecting the database of negative affect and aggression that involved unscripted role-plays between aggression regulation training actors (actors) and naive participants (students), where only short role descriptions and goals are given to the participants. In this paper we present a validation study for the database of negative affect and aggression by investigating whether the actors' behavior (e.g. becoming more aggressive) had a real impact on the students' emotional arousal. We found significant changes in the students' heart rate variability (HRV) parameters corresponding to changes in aggression level and emotional states of the actors, and therefore conclude that this method can be considered as a good candidate for emotion elicitation.
Autonomous systems are designed to carry out activities in remote, hazardous environments without the need for operators to micro-manage them. It is, however, essential that operators maintain situation awareness in order to monitor vehicle status and handle unforeseen circumstances that may affect their intended behaviour, such as a change in the environment. We present MIRIAM, a multimodal interface that combines visual indicators of status with a conversational agent component. This multimodal interface offers a fluid and natural way for operators to gain information on vehicle status and faults, mission progress and to set reminders. We describe the system and an evaluation study providing evidence that such an interactive multimodal interface can assist in maintaining situation awareness for operators of autonomous systems, irrespective of cognitive styles.
Existing assistive technologies often capture and utilize a single remaining ability to assist people with tetraplegia which is unable to do complex interaction efficiently. In this work, we developed a multimodal assistive system (MAS) to utilize multiple remaining abilities (speech, tongue, and head motion) sequentially or simultaneously to facilitate complex computer interactions such as scrolling, drag and drop, and typing long sentences.
Inputs of MAS can be used to drive a wheelchair using only tongue motion, mouse functionalities (e.g., clicks, navigation) by combining the tongue and head motions. To enhance seamless interface, MAS processes both head and tongue motions in the headset with an average accuracy of 88.5%.
In a pilot study, a modified center-out tapping task was performed by four able-bodied participants to navigate cursor, using head tracking, click using tongue command, and text entry through speech recognition, respectively. The average throughput in the final round was 1.28 bits/s and a cursor navigation path efficiency of 68.62%.
Affect recognition aims to detect a person's affective state based on observables, with the goal to e.g. improve human-computer interaction. Long-term stress is known to have severe implications on wellbeing, which call for continuous and automated stress monitoring systems. However, the affective computing community lacks commonly used standard datasets for wearable stress detection which a) provide multimodal high-quality data, and b) include multiple affective states. Therefore, we introduce WESAD, a new publicly available dataset for wearable stress and affect detection. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects during a lab study. The following sensor modalities are included: blood volume pulse, electrocardiogram, electrodermal activity, electromyogram, respiration, body temperature, and three-axis acceleration. Moreover, the dataset bridges the gap between previous lab studies on stress and emotions, by containing three different affective states (neutral, stress, amusement). In addition, self-reports of the subjects, which were obtained using several established questionnaires, are contained in the dataset. Furthermore, a benchmark is created on the dataset, using well-known features and standard machine learning methods. Considering the three-class classification problem ( baseline vs. stress vs. amusement ), we achieved classification accuracies of up to 80%,. In the binary case ( stress vs. non-stress ), accuracies of up to 93%, were reached. Finally, we provide a detailed analysis and comparison of the two device locations ( chest vs. wrist ) as well as the different sensor modalities.
When an automatic wheelchair or a self-carrying robot moves along with human agents, prediction for the next possible actions by the participating agents, play an important role in realization of successful cooperation among them. In this paper, we mounted a robot to a wheelchair body so that it provides embodied projective signals to the human agents, indicating the next possible action to be performed by the wheelchair. We have analyzed how human participants, particularly when they are in a multiparty interaction, would respond to such a system in experiments. We designed two settings for the robot's projective behavior. The first design allows the robot to face towards the human agents (Face-to-Face model), and the other allows it to face forward as the human agents do, then turn around to the human agents when it indicates where the wheelchair will move to (Body-Torque model). The analysis examined reactions by the human agents to the wheelchair, his/her accompanying human agent, and others who pass by them in the experiment's setting. The results show that the Body-Torque model seems more effective in enhancing cooperative behavior among the human participants than the Face-to-Face model when they are moving to a forward direction together.
Automatic analysis of advertisements (ads) poses an interesting problem for learning multimodal representations. A promising direction of research is the development of deep neural network autoencoders to obtain inter-modal and intra-modal representations. In this work, we propose a system to obtain segment-level unimodal and joint representations. These features are concatenated, and then averaged across the duration of an ad to obtain a single multimodal representation. The autoencoders are trained using segments generated by time-aligning frames between the audio and video modalities with forward and backward context. In order to assess the multimodal representations, we consider the tasks of classifying an ad as funny or exciting in a publicly available dataset of 2,720 ads. For this purpose we train the segment-level autoencoders on a larger, unlabeled dataset of 9,740 ads, agnostic of the test set. Our experiments show that: 1) the multimodal representations outperform joint and unimodal representations, 2) the different representations we learn are complementary to each other, and 3) the segment-level multimodal representations perform better than classical autoencoders and cross-modal representations -- within the context of the two classification tasks. We obtain an improvement of about 5% in classification accuracy compared to a competitive baseline.
Correctly interpreting an interlocutor's emotional expression is paramount to a successful interaction. But what happens when one of the interlocutors is a machine? The facilitation of human-machine communication and cooperation is of growing importance as smartphones, autonomous cars, or social robots increasingly pervade human social spaces. Previous research has shown that emotionally expressive virtual characters generally elicit higher cooperation and trust than 'neutral' ones. Since emotional expressions are multi-modal, and given that virtual characters can be designed to our liking in all their components, would a mismatch in the emotion expressed in the face and voice influence people's cooperation with a virtual character? We developed a game where people had to cooperate with a virtual character in order to survive on the moon. The character's face and voice were designed to either smile or not, resulting in 4 conditions: smiling voice and face, neutral voice and face, smiling voice only (neutral face), smiling face only (neutral voice). The experiment was set up in a museum over the course of several weeks; we report preliminary results from over 500 visitors, showing that people tend to trust the virtual character in the mismatched condition with the smiling face and neutral voice more. This might be because the two channels express different aspects of an emotion, as previously suggested.
Immersive virtual environments (IVEs) present rich possibilities for the experimental study of non-verbal communication. Here, the 'digital chameleon' effect, -which suggests that a virtual speaker (agent) is more persuasive if they mimic their addresses head movements-, was tested. Using a specially constructed IVE, we recreate a full-body analogue version of the 'digital chameleon' experiment. The agent's behaviour is manipulated in three conditions 1) Mimic (Chameleon) in which it copies the participant's nodding 2) Playback (Nodding Dog) which uses nods from playback of a previous participant and are therefore unconnected with the content and 3) Original (Human) in which it uses the prerecorded actor's movements. The results do not support the original finding of differences in ratings of agent persuasiveness between conditions. However, motion capture data reveals systematic differences in a) the real-time movements of speakers and listeners b) between the Original, Mimic and Playback conditions. We conclude that the automatic mimicry model is too simplistic and that this paradigm must address the reciprocal dynamics of non-verbal interaction to achieve its full potential.
This paper presents an approach for generating photorealistic video sequences of dynamically varying facial expressions in human-agent interactions. To this end, we study human-human interactions to model the relationship and influence of one individual's facial expressions in the reaction of the other. We introduce a two level optimization of generative adversarial models, wherein the first stage generates a dynamically varying sequence of the agent's face sketch conditioned on facial expression features derived from the interacting human partner. This serves as an intermediate representation, which is used to condition a second stage generative model to synthesize high-quality video of the agent face. Our approach uses a novel L1 regularization term computed from layer features of the discriminator, which are integrated with the generator objective in the GAN model. Session constraints are also imposed on video frame generation to ensure appearance consistency between consecutive frames. We demonstrated that our model is effective at generating visually compelling facial expressions. Moreover, we quantitatively showed that agent facial expressions in the generated video clips reflect valid emotional reactions to behavior of the human partner.
This paper presents a novel approach for automatic prediction of risk of ADHD in schoolchildren based on touch interaction data. We performed a study with 129 fourth-grade students solving math problems on a multiple-choice interface to obtain a large dataset of touch trajectories. Using Support Vector Machines, we analyzed the predictive power of such data for ADHD scales. For regression of overall ADHD scores, we achieve a mean squared error of 0.0962 on a four-point scale (R² = 0.5667). Classification accuracy for increased ADHD risk (upper vs. lower third of collected scores) is 91.1%.
Creating tactile representations of visual information, especially moving images, is difficult due to a lack of available tactile computing technology and a lack of tools for authoring tactile information. To address these limitations, we developed a software framework that enables educators and other subject experts to create graphical representations that combine audio descriptions with kinetic motion. These audio-kinetic graphics can be played back using off-the-shelf computer hardware. We report on a study in which 10 educators developed content using our framework, and in which 18 people with vision impairments viewed these graphics on our output device. Our findings provide insights on how to translate knowledge of visual information to non-visual formats.
Modern smartphones are built with capacitive-sensing touchscreens, which can detect anything that is conductive or has a dielectric differential with air. The human finger is an example of such a dielectric, and works wonderfully with such touchscreens. However, touch interactions are disrupted by raindrops, water smear, and wet fingers because capacitive touchscreens cannot distinguish finger touches from other conductive materials. When users' screens get wet, the screen's usability is significantly reduced. RainCheck addresses this hazard by filtering out potential touch points caused by water to differentiate fingertips from raindrops and water smear, adapting in real-time to restore successful interaction to the user. Specifically, RainCheck uses the low-level raw sensor data from touchscreen drivers and employs precise selection techniques to resolve water-fingertip ambiguity. Our study shows that RainCheck improves gesture accuracy by 75.7%, touch accuracy by 47.9%, and target selection time by 80.0%, making it a successful remedy to interference caused by rain and other water.
Emotion recognition is a core research area at the intersection of artificial intelligence and human communication analysis. It is a significant technical challenge since humans display their emotions through complex idiosyncratic combinations of the language, visual and acoustic modalities. In contrast to traditional multimodal fusion techniques, we approach emotion recognition from both direct person-independent and relative person-dependent perspectives. The direct person-independent perspective follows the conventional emotion recognition approach which directly infers absolute emotion labels from observed multimodal features. The relative person-dependent perspective approaches emotion recognition in a relative manner by comparing partial video segments to determine if there was an increase or decrease in emotional intensity. Our proposed model integrates these direct and relative prediction perspectives by dividing the emotion recognition task into three easier subtasks. The first subtask involves a multimodal local ranking of relative emotion intensities between two short segments of a video. The second subtask uses local rankings to infer global relative emotion ranks with a Bayesian ranking algorithm. The third subtask incorporates both direct predictions from observed multimodal behaviors and relative emotion ranks from local-global rankings for final emotion prediction. Our approach displays excellent performance on an audio-visual emotion recognition benchmark and improves over other algorithms for multimodal fusion.
Robots, virtual assistants, and other intelligent agents need to effectively interpret verbal references to environmental objects in order to successfully interact and collaborate with humans in complex tasks. However, object disambiguation can be a challenging task due to ambiguities in natural language. To reduce uncertainty when describing an object, humans often use a combination of unique object features and locative prepositions --prepositional phrases that describe where an object is located relative to other features (i.e., reference objects) in a scene. We present a new system for object disambiguation in cluttered environments based on probabilistic models of unique object features and spatial relationships. Our work extends prior models of spatial relationship semantics by collecting and encoding empirical data from a series of crowdsourced studies to better understand how and when people use locative prepositions, how reference objects are chosen, and how to model prepositional geometry in 3D space (e.g., capturing distinctions between "next to" and "beside"). Our approach also introduces new techniques for responding to compound locative phrases of arbitrary complexity and proposes a new metric for disambiguation confidence. An experimental validation revealed our method can improve object disambiguation accuracy and performance over past approaches.
Tactile information in a palm is a necessary component in manipulating and perceiving large or heavy objects. Noting this, we investigate human sensitivity to tactile haptic feedback in a palm for an improved user interface design. To provide distributed tactile pattern, we propose an ungrounded haptic interface, which can stimulate multiple locations in a palm, independently. Two experiments were conducted to evaluate human sensitivity to distributed tactile patterns. The first experiment tested participants' sensitivity to tactile patterns by sub-sections in a palm, and a significant effect of the sub-section on the sensitivity was observed. In the second experiment, participants identified pressure distribution patterns in the palm collected from real-life objects with the percent correct of 71.4 % and IT (information transfer) was 1.58 bits.
Social skills training, performed by human trainers, is a well-established method for obtaining appropriate skills in social interaction. Previous work automated the process of social skills training by developing a dialogue system that teaches social skills through interaction with a computer agent. Even though previous work that simulated social skills training considered speaking skills, human social skills trainers take into account other skills such as listening. In this paper, we propose assessment of user listening skills during conversation with computer agents toward automated social skills training. We recorded data of 27 Japanese graduate students interacting with a female agent. The agent spoke to the participants about a recent memorable story and how to make a telephone call, and the participants listened. Two expert external raters assessed the participants' listening skills. We manually extracted features relating to eye fixation and behavioral cues of the participants, and confirmed that a simple linear regression with selected features can correctly predict a user's listening skills with above 0.45 correlation coefficient.
While many organizations provide a website in multiple languages, few provide a sign-language version for deaf users, many of whom have lower written-language literacy. Rather than providing difficult-to-update videos of humans, a more practical solution would be for the organization to specify a script (representing the sequence of words) to generate a sign-language animation. The challenge is we must select the accurate speed and timing of signs. In this work, focused on American Sign Language (ASL), motion-capture data recorded from humans is used to train machine learning models to calculate realistic timing for ASL animation movement, with an initial focus on inserting prosodic breaks (pauses), adjusting the pause durations for these pauses, and adjusting differentials signing rate for ASL animations based on the sentence syntax and other features. The methodology includes processing and cleaning data from an ASL corpus with motion-capture recordings, selecting features, and building machine learning models to predict where to insert pauses, length of pauses, and signing speed. The resulting models were evaluated using a cross-validation approach to train and test multiple models on various partitions of the dataset, to compare various learning algorithms and subsets of features. In addition, a user-based evaluation was conducted in which native ASL signers evaluated animations generated based on these models. This paper summarizes the motivations for this work, proposed solution, and the potential contribution of this work. This paper describes both completed work and some additional future research plans.
Group meetings are often inefficient, unorganized and poorly documented. Factors including "group-think," fear of speaking, unfocused discussion, and bias can affect the performance of a group meeting. In order to actively or passively facilitate group meetings, automatically analyzing group interaction patterns is critical. Existing research on group dynamics analysis still heavily depends on video cameras in the lines of sight of participants or wearable sensors, both of which could affect the natural behavior of participants. In this thesis, we present a smart meeting room that combines microphones and unobtrusive ceiling-mounted Time-of-Flight (ToF) sensors to understand group dynamics in team meetings. Since the ToF sensors are ceiling-mounted and out of the lines of sight of the participants, we posit that their presence would not disrupt the natural interaction patterns of individuals. We collect a new multi-modal dataset of group interactions where participants have to complete a task by reaching a group consensus, and then fill out a post-task questionnaire. We use this dataset for the development of our algorithms and analysis of group meetings. In this paper, we combine the ceiling-mounted ToF sensors and lapel microphones to: (1) estimate the seated body orientation of participants, (2) estimate the head pose and visual focus of attention (VFOA) of meeting participants, (3) estimate the arm pose and body posture of participants, and (4) analyze the multimodal data for passive understanding of group meetings, with a focus on perceived leadership and contribution.
Augmented reality eyewear devices (e.g. glasses, headsets) are poised to become ubiquitous in a similar way than smartphones, by providing a quicker and more convenient access to information. There is theoretically no limit to their applicative area and use cases and many of them are already explored such as medical, education, industry, entertainment, military? Some interactions to these eyewear devices are becoming a standard such as mid-air hand gestures and voice command. Paradoxically, nowadays, in many use cases where these kinds of eyewear devices are currently implemented, the users cannot perform these available interactions without constraints: e.g. when the users are already using their hands, when they are in a noisy environment or the opposite where silence is required and the vocal command could not be used properly, or even in a social context where both mid-air hand gestures and vocal command could be seen as weird or undesired for the users. Thus, this thesis project aims to extend interactivity of augmented reality eyewear devices: 1) by providing more discrete interaction such as head gesture based on cognitive image schemas theory, metaphorical extension and natural user interfaces based on the smart watch finger touch gesture, 2) by using the context of the user to provide the more convenient interface and feedback in the right space and time. The underlying objective of this project is to facilitate the acceptance and usage of augmented reality eyewear devices.
There are various real-world applications such as video ads, airport screenings, courtroom trials, and job interviews where deception detection can play a crucial role. Hence, there are immense demands on deception detection in videos. Videos contain rich information including acoustic, visual, temporal, and/or linguistic information, which provides great opportunities for advanced deception detection. However, videos are inherently complex; moreover, they lack detective labels in many real-world applications, which poses tremendous challenges to traditional deception detection. In this manuscript, I present my Ph.D. research on the problem of deception detection in videos. In particular, I provide a principled way to capture rich information into a coherent model and propose an end-to-end framework DEV to detect DEceptive Videos automatically. Preliminary results on real-world videos demonstrate the effectiveness of the proposed framework.
Analysis of the student engagement in an e-learning environment would facilitate effective task accomplishment and learning. Generally, engagement/disengagement can be estimated from facial expressions, body movements and gaze pattern. The focus of this Ph.D. work is to explore automatic student engagement assessment while watching Massive Open Online Courses (MOOCs) video material in the real-world environment. Most of the work till now in this area has been focusing on engagement assessment in lab-controlled environments. There are several challenges involved in moving from lab-controlled environments to real-world scenarios such as face tracking, illumination, occlusion, and context. The early work in this Ph.D. project explores the student engagement while watching MOOCs. The unavailability of any publicly available dataset in the domain of user engagement motivates to collect dataset in this direction. The dataset contains 195 videos captured from 78 subjects which are about 16.5 hours of recording. This dataset is independently annotated by different labelers and final label is derived from the statistical analysis of the individual labels given by the different annotators. Various traditional machine learning algorithm and deep learning based networks are used to derive baseline of the dataset. Engagement prediction and localization are modeled as Multi-Instance Learning (MIL) problem. In this work, the importance of Hierarchical Attention Network (HAN) is studied. This architecture is motivated from the hierarchical nature of the problem where a video is made up of segments and segments are made up of frames.
Social robots need non-verbal behavior to make an interaction pleasant and efficient. Most of the models for generating non-verbal behavior are rule-based and hence can produce a limited set of motions and are tuned to a particular scenario. In contrast, data-driven systems are flexible and easily adjustable. Hence we aim to learn a data-driven model for generating non-verbal behavior (in a form of a 3D motion sequence) for humanoid robots.
Our approach is based on a popular and powerful deep generative model: Variation Autoencoder (VAE). Input for our model will be multi-modal and we will iteratively increase its complexity: first, it will only use the speech signal, then also the text transcription and finally - the non-verbal behavior of the conversation partner. We will evaluate our system on the virtual avatars as well as on two humanoid robots with different embodiments: NAO and Furhat. Our model will be easily adapted to a novel domain: this can be done by providing application specific training data.
I introduce a novel multi-modal multi-sensor interaction method between humans and heterogeneous multi-robot systems. I have also developed a novel algorithm to control heterogeneous multi-robot systems. The proposed algorithm allows the human operator to provide intentional cues and information to a multi-robot system using a multimodal multi-sensor touchscreen interface. My proposed method can effectively convey complex human intention to multiple robots as well as represent robots' intentions over the spatiotemporal domain. The proposed method is scalable and robust to dynamic change in the deployment configuration. I describe the implementation of the control algorithm used to control multiple quad-rotor unmanned aerial vehicles in simulated and real environments. I will also present my initial work on human interaction with the robots running my algorithm using mobile phone touch screens and other potential multimodal interactions.
Multi-modal sentiment detection from natural video/audio streams has recently received much attention. I propose to use this multi-modal information to develop a technique, Sentiment Coloring , that utilizes the detected sentiments to generate effective responses. In particular, I aim to produce suggested responses colored with sentiment appropriate for that present in the interlocutor's speech. To achieve this, contextual information pertaining to sentiment, extracted from the past as well as the current speech of both the speakers in a dialog, will be utilized. Sentiment, here, includes the three polarities: positive, neutral and negative, as well as other expressions of stance and attitude. Utilizing only the non-verbal cues, namely, prosody and gaze, I will implement two algorithmic approaches and compare their performance in sentiment detection: a simple machine learning algorithm (neural networks), that will act as the baseline, and a deep learning approach, an end-to-end bidirectional LSTM RNN, which is the state-of-the-art in emotion classification. I will build a responsive spoken dialog system(s) with this Sentiment Coloring technique and evaluate the same with human subjects to measure benefits of the technique in various interactive environments.
This work searches to explore the potential of textile sensing systems as a new modality of capturing social behaviour. Hereby, the focus lies on evaluating the performance of embedded pressure sensors as reliable detectors for social cues, such as postural states. We have designed chair covers and trousers that were evaluated in two studies. The results show that these relatively simple sensors can distinguish postures as well as different behavioural cues.
We like to conversate with other people using both sounds and visuals, as our perception of speech is bimodal. Essentially echoing the same speech structure, we manage to integrate the two modalities and often understand the message better than with the eyes closed. In this work we would like to learn more about the visual nature of speech, coined lip-reading, and to make use of it towards better automatic speech recognition systems. Recent developments in the Machine Learning area, together with the release of suitable audio-visual datasets aimed at large vocabulary continuous speech recognition, have led to a renewal of the lip-reading topic, and allow us to address the recurring question of how to better integrate visual and acoustic speech.
Automatic analysis of teacher student interactions is an interesting research problem in social computing. Such interactions happen in both online and class room settings. While teaching effectiveness is the goal in both settings, the mechanism to achieve the same could differ in different settings. In order to characterize these interactions multimodal behavioral signals and language use need to be measured, and a model to predict effectiveness needs to be learnt. These would help characterize the teaching skill of the teacher and level of engagement of students. Also, there could be multiple styles of teaching which can be effective.
This paper is intended to outline the PhD research that is aimed to model empathy in embodied conversational systems. Our goal is to determine the requirements for implementation of an empathic interactive agent and develop evaluation methods that is aligned with the empathy research from various fields. The thesis is composed of three scientific contributions: (i) developing a computational model of empathy, (ii) implementation of the model in embodied conversational agents and (iii) enhance the understanding of empathy in interaction by generating data and build evaluation tools. The paper will give results for the contribution (i) and preliminary results for contribution (ii). Moreover, we will present the future plan for contribution (ii) and (iii).
This work introduces EVA, a multimodal argumentative Dialogue System that is capable of discussing controversial topics with the user. The interaction is structured as an argument game in which the user and the system select respective moves in order to convince their opponent. EVA's response is presented as a natural language utterance by a virtual agent that supports the respective content using characteristic gestures and mimic.
Tracking learners' engagement is useful for monitoring their learning quality. With an increasing number of online video courses, a system that can automatically track learners' engagement is expected to significantly help in improving the outcomes of learners' study. In this demo, we show such a system to predict a user's engagement changes in real time. Our system utilizes webcams ubiquitously existing in nowadays computers, the face tracking function that runs inside the Web browsers to avoid sending learners' videos to the cloud, and a Python Flask web service. Our demo provides a solution of using mature technologies to provide real-time engagement monitoring with privacy protection.
This work describes our approach to controlling lighter-than-air agents using multimodal control via a wearable device. Tactile and gesture interfaces on a smart watch are used to control the motion and altitude of these semi-autonomous agents. The tactile interface consists of the touch screen and rotatable bezel. The gesture interface detects when the user puts his/her hand in the stop position. The touch interface controls the direction of the agents, the rotatable bezel controls the altitude set-point, and the gesture interface stops the agents. Our interactive demonstration will allow users to control a lighter-than-air (LTA) system via the multimodal wearable interface as described above.
Autonomous systems in remote locations have a high degree of autonomy and there is a need to explain what they are doing and why , in order to increase transparency and maintain trust. This is particularly important in hazardous, high-risk scenarios. Here, we describe a multimodal interface, MIRIAM, that enables remote vehicle behaviour to be queried by the user, along with mission and vehicle status. These explanations, as part of the multimodal interface, help improve the operator's mental model of what the vehicle can and can't do, increase transparency and assist with operator training.
The multimodal recognition of eating condition - whether a person is eating or not - and if yes, which food type, is a new research domain in the area of speech and video processing that has many promising applications for future multimodal interfaces such as adapting speech recognition or lip reading systems to different eating conditions. We herein describe the ICMI 2018 Eating Analysis and Tracking (EAT) Challenge and address - for the first time in research competitions under well-defined conditions - new classification tasks in the area of user data analysis, namely audio-visual classifications of user eating conditions. We define three Sub-Challenges based on classification tasks in which participants are encouraged to use speech and/or video recordings of the audio-visual iHEARu-EAT database. In this paper, we describe the dataset, the Sub-Challenges, their conditions, and the baseline feature extraction and performance measures as provided to the participants.
Automatic recognition of eating conditions of humans could be a useful technology in health monitoring. The audio-visual information can be used in automating this process, and feature engineering approaches can reduce the dimensionality of audio-visual information. The reduced dimensionality of data (particularly feature subset selection) can assist in designing a system for eating conditions recognition with lower power, cost, memory and computation resources than a system which is designed using full dimensions of data. This paper presents Active Feature Transformation (AFT) and Active Feature Selection (AFS) methods, and applies them to all three tasks of the ICMI 2018 EAT Challenge for recognition of user eating conditions using audio and visual features. The AFT method is used for the transformation of the Mel-frequency Cepstral Coefficient and ComParE features for the classification task, while the AFS method helps in selecting a feature subset. Transformation by Principal Component Analysis (PCA) is also used for comparison. We find feature subsets of audio features using the AFS method (422 for Food Type, 104 for Likability and 68 for Difficulty out of 988 features) which provide better results than the full feature set. Our results show that AFS outperforms PCA and AFT in terms of accuracy for the recognition of user eating conditions using audio features. The AFT of visual features (facial landmarks) provides less accurate results than the AFS and AFT sets of audio features. However, the weighted score fusion of all the feature set improves the results.
In this paper, we mainly investigate subjects' food likability based on audio-related features as a contribution to EAT ? the ICMI 2018 Eating Analysis and Tracking challenge. Specifically, we conduct 4-level Double Tree Complex Wavelet Transform decomposition of an audio signal, and obtain five sub-audio signals with frequencies ranging from low to high. For each sub-audio signal, not only 'traditional' functional-based features but also deep learning-based features via pretrained CNNs based on SliCQ-nonstationary Gabor transform and a cochleagram map, are calculated. Besides, the original audio signals based Bag-of-Audio-Words features extracted by the openXBOW toolkit are used to enhance the model as well. Finally, the early fusion of all these three kinds of features can lead to promising results, yielding the highest UAR of 79.2 % by means of a leave-one-speaker-out cross-validation, which holds a 12.7 % absolute gain compared with the baseline of 66.5 % UAR.
The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3% on the test set of the iHEARu-EAT database.
This paper presents the novel Functional-based acoustic Group Feature Selection (FGFS) method for automatic eating condition recognition addressed in the ICMI 2018 Eating Analysis and Tracking Challenge's Food-type Sub-Challenge. The Food-type Sub-Challenge employs the audiovisual iHEARu-EAT database and attempts to classify which of six food types, or none, is being consumed by subjects while speaking. The approach proposed by the FGFS method uses the audio mode and considers the acoustic feature space in groups rather than individually. Each group is comprised of acoustic features generated by the application of a statistical functional to a specified set of the low-level descriptors of the audio data. The FGFS method provides information about the degree of relevance of the statistical functionals to the task. In addition, the partitioning of features into groups allows for more rapid processing of the official Sub-Challenge's large acoustic feature set. The FGFS-based system achieves 2.8% relative Unweighted Average Recall performance improvement over the official Food-type Sub-Challenge baseline on iHEARu-EAT test data.
Emotion recognition (ER) based on natural facial images/videos has been studied for some years and considered a comparatively hot topic in the field of affective computing. However, it remains a challenge to perform ER in the wild, given the noises generated from head pose, face deformation, and illumination variation. To address this challenge, motivated by recent progress in Convolutional Neural Network (CNN), we develop a novel deeply supervised CNN (DSN) architecture, taking the multi-level and multi-scale features extracted from different convolutional layers to provide a more advanced representation of ER. By embedding a series of side-output layers, our DSN model provides class-wise supervision and integrates predictions from multiple layers. Finally, our team ranked 3rd at the EmotiW 2018 challenge with our model achieving an accuracy of 61.1%.
This paper presents a light-weight and accurate deep neural model for audiovisual emotion recognition. To design this model, the authors followed a philosophy of simplicity, drastically limiting the number of parameters to learn from the target datasets, always choosing the simplest learning methods: i) transfer learning and low-dimensional space embedding allows to reduce the dimensionality of the representations, ii) visual temporal information handled by a simple score-per-frame selection process averaged across time, iii) simple frame selection mechanism for weighting images within sequences, iv) fusion of the different modalities at prediction level (late fusion). The paper also highlights the inherent challenges of the AFEW dataset and the difficulty of model selection with as few as 383 validation sequences. The proposed real-time emotion classifier achieved a state-of-the-art accuracy of 60.64 % on the test set of AFEW, and ranked 4th at the Emotion in the Wild 2018 challenge.
This paper elaborates the winner approach for engagement intensity prediction in the EmotiW Challenge 2018. The task is to predict the engagement level of a subject when he or she is watching an educational video in diverse conditions and different environments. Our approach formulates the prediction task as a multi-instance regression problem. We divide an input video sequence into segments and calculate the temporal and spatial features of each segment for regressing the intensity. Subject engagement, that is intuitively related with body and face changes in time domain, can be characterized by long short-term memory (LSTM) network. Hence, we build a multi-modal regression model based on multi-instance mechanism as well as LSTM. To make full use of training and validation data, we train different models for different data split and conduct model ensemble finally. Experimental results show that our method achieves mean squared error (MSE) of 0.0717 in the validation set, which improves the baseline results by 28%. Our methods finally win the challenge with MSE of 0.0626 on the testing set.
In this paper, we propose an automatic engagement prediction method for the Engagement in the Wild sub-challenge of EmotiW 2018. We first design a novel Gaze-AU-Pose (GAP) feature taking into account the information of gaze, action units and head pose of a subject. The GAP feature is then used for the subsequent engagement level prediction. To efficiently predict the engagement level for a long-time video, we divide the long-time video into multiple overlapped video clips and extract GAP feature for each clip. A deep model consisting of a Gated Recurrent Unit (GRU) layer and a fully connected layer is used as the engagement predictor. Finally, a mean pooling layer is applied to the per-clip estimation to get the final engagement level of the whole video. Experimental results on the validation set and test set show the effectiveness of the proposed approach. In particular, our approach achieves a promising result with an MSE of 0.0724 on the test set of Engagement Prediction Challenge of EmotiW 2018.t with an MSE of 0.072391 on the test set of Engagement Prediction Challenge of EmotiW 2018.
Engagement is the holy grail of learning whether it is in a classroom setting or an online learning platform. Studies have shown that engagement of the student while learning can benefit students as well as the teacher if the engagement level of the student is known. It is difficult to keep track of the engagement of each student in a face-to-face learning happening in a large classroom. It is even more difficult in an online learning platform where, the user is accessing the material at different instances. Automatic analysis of the engagement of students can help to better understand the state of the student in a classroom setting as well as online learning platforms and is more scalable. In this paper we propose a framework that uses Temporal Convolutional Network (TCN) to understand the intensity of engagement of students attending video material from Massive Open Online Courses (MOOCs). The input to the TCN network is the statistical features computed on 10 second segments of the video from the gaze, head pose and action unit intensities available in OpenFace library. The ability of the TCN architecture to capture long term dependencies gives it the ability to outperform other sequential models like LSTMs. On the given test set in the EmotiW 2018 sub challenge-"Engagement in the Wild", the proposed approach with Dilated-TCN achieved an average mean square error of 0.079.
In this paper we propose a new approach for classifying the global emotion of images containing groups of people. To achieve this task, we consider two different and complementary sources of information: i) a global representation of the entire image (ii) a local representation where only faces are considered. While the global representation of the image is learned with a convolutional neural network (CNN), the local representation is obtained by merging face features through an attention mechanism. The two representations are first learned independently with two separate CNN branches and then fused through concatenation in order to obtain the final group-emotion classifier. For our submission to the EmotiW 2018 group-level emotion recognition challenge, we combine several variations of the proposed model into an ensemble, obtaining a final accuracy of 64.83% on the test set and ranking 4th among all challenge participants.
Precise detection and localization of learners' engagement levels are useful for monitoring their learning quality. In the emotiW Challenge's engagement detection task, we proposed a series of novel improvements, including (a) a cluster-based framework for fast engagement level predictions, (b) a neural network using the attention pooling mechanism, (c) heuristic rules using body posture information, and (d) model ensemble for more accurate and robust predictions. Our experimental results suggest that our proposed methods effectively improved engagement detection performance. On the validation set, our system can reduce the baseline Mean Squared Error (MSE) by about 56%. On the final test set, our system yielded a competitively low MSE of 0.081.
Group-level Emotion Recognition (GER) in the wild is a challenging task gaining lots of attention. Most recent works utilized two channels of information, a channel involving only faces and a channel containing the whole image, to solve this problem. However, modeling the relationship between faces and scene in a global image remains challenging. In this paper, we proposed a novel face-location aware global network, capturing the face location information in the form of an attention heatmap to better model such relationships. We also proposed a multi-scale face network to infer the group-level emotion from individual faces, which explicitly handles high variance in image and face size, as images in the wild are collected from different sources with different resolutions. In addition, a global blurred stream was developed to explicitly learn and extract the scene-only features. Finally, we proposed a four-stream hybrid network, consisting of the face-location aware global stream, the multi-scale face stream, a global blurred stream, and a global stream, to address the GER task, and showed the effectiveness of our method in GER sub-challenge, a part of the six Emotion Recognition in the Wild (EmotiW 2018)  Challenge. The proposed method achieved 65.59% and 78.39% accuracy on the testing and validation sets, respectively, and is ranked the third place on the leaderboard.
In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video Emotion Recognition subtask in the 2018 Emotion Recognition in the Wild (EmotiW) Challenge. After the multimodal results fusion, our final accuracy in Acted Facial Expression in Wild (AFEW) test dataset achieves 61.87%, which is 1.53% higher than the best results last year. Such improvements prove the effectiveness of our methods.
This paper presents a hybrid deep learning network submitted to the 6th Emotion Recognition in the Wild (EmotiW 2018) Grand Challenge , in the category of group-level emotion recognition. Advanced deep learning models trained individually on faces, scenes, skeletons and salient regions using visual attention mechanisms are fused to classify the emotion of a group of people in an image as positive, neutral or negative. Experimental results show that the proposed hybrid network achieves 78.98% and 68.08% classification accuracy on the validation and testing sets, respectively. These results outperform the baseline of 64% and 61%, and achieved the first place in the challenge.
This paper presents our approach for group-level emotion recognition sub-challenge in the EmotiW 2018. The task is to classify an image into one of the group emotions such as positive, negative, and neutral. Our approach mainly explores three cues, namely face, body and global image with recent deep networks. Our main contribution is two-fold. First, we introduce body based Convolutional Neural Networks (CNNs) into this task based on our previous winner method . For body based CNNs, we crop all bodies in an image with the state-of-the-art human pose estimation method and train CNNs with the image-level label to capture. The body cue captures a full view of an individual. Second, we propose a cascade attention network for the face cue in images. This network exploits the importance of each face in an image to generates a global representation based on all faces. The cascade attention network is not only complementary with other models but also improves the naive average pooling method by about 2%. We finally achieve the second place in this sub-challenge with classification accuracies of 86.9% and 67.48% on the validation set and testing set, respectively.
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
This paper details the sixth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW 2018 is a grand challenge in the ACM International Conference on Multimodal Interaction 2018, Colarado, USA. The challenge aims at providing a common platform to researchers working in the affective computing community to benchmark their algorithms on 'in the wild' data. This year EmotiW contains three sub-challenges: a) Audio-video based emotion recognition; b) Student engagement prediction; and c) Group-level emotion recognition. The databases, protocols and baselines are discussed in detail.
This is the introduction paper to the third version of the workshop on 'Multisensory Approaches to Human-Food Interaction' organized at the 20th ACM International Conference on Multimodal Interaction in Boulder, Colorado, on October 16th, 2018. This workshop is a space where the fast growing research on Multisensory Human-Food Interaction is presented. Here we summarize the workshop's key objectives and contributions.
Analysis of group interaction and team dynamics is an important topic in a wide variety of fields, owing to the amount of time that individuals typically spend in small groups for both professional and personal purposes, and given how crucial group cohesion and productivity are to the success of businesses and other organizations. This fact is attested by the rapid growth of fields such as People Analytics and Human Resource Analytics, which in turn have grown out of many decades of research in social psychology, organizational behaviour, computing, and network science, amongst other fields. The goal of this workshop is to bring together researchers from diverse fields related to group interaction, team dynamics, people analytics, multi-modal speech and language processing, social psychology, and organizational behaviour.
Multimodal signals allow us to gain insights into internal cognitive processes of a person, for example: speech and gesture analysis yields cues about hesitations, knowledgeability, or alertness, eye tracking yields information about a person's focus of attention, task, or cognitive state, EEG yields information about a person's cognitive load or information appraisal. Capturing cognitive processes is an important research tool to understand human behavior as well as a crucial part of a user model to an adaptive interactive system such as a robot or a tutoring system. As cognitive processes are often multifaceted, a comprehensive model requires the combination of multiple complementary signals. In this workshop at the ACM International Conference on Multimodal Interfaces (ICMI) conference in Boulder, Colorado, USA, we discussed the state-of-the-art in monitoring and modeling cognitive processes from multi-modal signals.
This paper presents an introduction to the "Human-Habitat for Health (H3): Human-habitat multimodal interaction for promoting health and well-being in the Internet of Things era" workshop, which was held at the 20th ACM International Conference on Multimodal Interaction on October 16th, 2018, in Boulder, CO, USA. The main theme of the workshop focused on the effect of the physical or virtual environment on individual's behavior, well-being, and health. The H3 workshop included keynote speeches that provided an overview and future directions of the field, as well as presentations including position papers and research contributions. The workshop brought together experts from academia and industry spanning a set of multi-disciplinary fields, including computer science, speech and spoken language understanding, construction science, life-sciences, health sciences, and psychology, to discuss their respective views and identify synergistic and converging research directions and solutions.
In this paper a brief overview of the third workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction. The paper is focussing on the main aspects intended to be discussed in the workshop reflecting the main scope of the papers presented during the meeting. The MA3HMI 2018 workshop is held in conjunction with the 18th ACM International Conference on Mulitmodal Interaction (ICMI 2018) taking place in Boulder, USA, in October 2018. This year, we have solicited papers concerning the different phases of the development of multimodal systems. Tools and systems that address real-time conversations with artificial agents and technical systems are also within the scope.