MA3HMI'18- Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction

Full Citation in the ACM Digital Library

Analysis of the Effect of Agent's Embodiment and Gaze Amount on Personality Perception

In this study, we implemented a gaze model, based on Japanese gaze behavior analysis, on an eyeball manipulable robot and virtual agent. Then, we analyzed the effects of agent embodiment and change in gaze amount on the agent's personality perception with Japanese participants. The results suggested that it is possible to express extroversion and confidence in the agent by changing the agent's gaze amount regardless of the agent's embodiment. In addition, the levels of perceived extroversion as gaze amount increases differ depending on the embodiment of the agents. The virtual agent's perceived extroversion increased in proportion to increases in its gaze amount, whereas the robot's extroversion increased more slowly with increases in its gaze amount.

User Affect and No-Match Dialogue Scenarios: An Analysis of Facial Expression

Recent years have seen significant advances in natural language dialogue management and a growing recognition that multimodality can inform dialogue policies. A key dialogue policy problem is presented by 'no-match' scenarios, in which the dialogue system receives a user utterance for which no matching response is found. This paper reports on a study of the 'no-match' problem in the context of a dialogue agent embedded within a game-based learning environment. We investigate how users' facial expressions exhibited in response to the agent's no-match utterances predict the users' opinion of the agent after the interaction has completed. The results indicate that models incorporating users' facial expressions following no-match utterances are highly predictive of user opinion and significantly outperform baseline models. This work represents a key step toward affect-informed dialogue systems whose policies are informed by users' affective expression.

Exploring Siamese Neural Network Architectures for Preserving Speaker Identity in Speech Emotion Classification

Voice-enabled communication is increasingly being used in real-world applications, such as the ones involving conversational bots or "chatbots". Chatbots can spark and sustain user engagement by effectively recognizing their emotions and acting upon them. However, the majority of emotion recognition systems rely on rich spectrotemporal acoustic features. Beyond the emotion-related information, such systems tend to preserve information relevant to the identity of the speaker, therefore raising major privacy concerns from the users. This paper introduces two hybrid architectures for privacy-preserving emotion recognition from speech. These architectures rely on a Siamese neural network, whose input and intermediate layers are transformed using various privacy-performing operations in order to retain emotion-dependent content and suppress information related to the identity of a speaker. The proposed approach is evaluated through emotion classification and speaker identification performance metrics. Results indicate that the proposed framework can achieve up to 67.4% for classifying between happy, sad, frustrated, anger, neutral and other emotions, obtained from the publicly available Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. At the same time, the proposed approach reduces speaker identification accuracy to 50%, compared to 81%, the latter being achieved through a feedforward neural network solely trained on the speaker identification task using the same input features.

Extracting Interpersonal Stance from Vocal Signals

The role of emotions and other affective states within Human-Computer Interaction (HCI) is gaining importance. Introducing affect into computer applications typically makes these systems more efficient, effective and enjoyable. This paper presents a model that is able to extract interpersonal stance from vocal signals. To achieve this, a dataset of 3840 sentences spoken by 20 semi-professional actors was built and was used to train and test a model based on Support Vector Machines (SVM). An analysis of the results indicates that there is much variation in the way people express interpersonal stance, which makes it difficult to build a generic model. Instead, the model shows good performance on the individual level (with accuracy above 80%). The implications of these findings for HCI systems are discussed.

A Pilot Study on Adaptive Gesture Use in Interaction with Non-native Listeners

Given that human speakers adapt their communicative behavior towards non-native listeners -- a phenomenon known as foreigner talk or teacher talk -- the communicative behavior of an interactive, intelligent virtual agent (IIVA) should also to be adaptive towards the needs of non-native listeners. To investigate the question whether it makes sense to distinguish diferent degrees of language proficiency in non-native addressees when designing communicative behavior skills for IIVAs in mixed-cultural settings, we present first results from a pilot study that is meant to prepare a comprehensive corpus collection. Native speakers of German were asked to explain given German terms to non-native speakers of either low or intermediate language proficiency in German. Results showed significant differences in gesture frequency and also in the types and size of gestures being used, depending on the language proficiency of the non-native listener.

Recognition of Human Movement Patterns during a Human-Agent Interaction

The analysis of human behaviour during an interaction with an interlocutor shows a large number of facets. An interesting part is the moving behaviour of a subject while communicating, especially in the context of a human-computer interaction. In particular, the paper focusses on the recognition of movement patterns in a room during an interaction with two virtual agents. For this, we investigated a close-to-real-life scenario providing two virtual agents on different screens (i.e. the CASIA Coffee House Corpus). In this context, a feature set consisting of ten statistical features for the (automatic) recognition of movement patterns is proposed. Further, we automatically clustered the samples provided in the corpus and cross-checked the results with a manual annotation. For this, we identified two meaningful movement patterns for which we assume that they will appear also in similar other scenarios. Finally, we automatically classified the movement patterns based on the proposed features applying a multi-layer perceptron. We obtained an average error rate of 12.0%.

Multimodal Reference Resolution In Collaborative Assembly Tasks

Humans use verbal and non-verbal cues to communicate their intent in collaborative tasks. In situated dialogue, speakers typically direct their interlocutor's attention to referent objects using multimodal cues, and references to such entities are resolved in a collaborative nature. In this study we designed a multiparty task where humans teach each other how to assemble furniture, and captured eye-gaze, speech and pointing gestures. We analysed which multimodal cues carry the most information for resolving referring expressions, and report an object saliency classifier that using a multisensory input from speaker and addressee, detects the referent objects during the collaborative task.

PauseCode: Computational Conversation Timing Analysis

Pauses play a critical role in adding, shifting or contradicting meaning in a conversation. To enable the study and incorporation of this important modality in computational discourse analytic and processing systems, we require extensible open source pause coding systems and associated software libraries. We designed and implemented a coding and visualisation system for pause and overlap detection and analysis, extending existing voicing and silence detection algorithms. Demonstrating the system using the TalkBank CallFriend and CallHome corpora we show how the approach can be used to code many different kinds of pauses and overlaps within and between interlocutors, and calculate the temporal distribution of these different types of pause and overlap. The coding schema is intended to be combined with other speech modalities to provide novel approaches to predicting social cues and markers, useful for designing more naturalistic conversational agents, and in new tools for measuring turn-taking structure of conversation in greater depth and accuracy.