ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

Full Citation in the ACM Digital Library

SESSION: ICMI 2021 Late-Breaking Results

Knock&Tap: Classification and Localization of Knock and Tap Gestures using Deep Sound Transfer Learning

Gesture interaction is considered one of the promising approaches to control smart devices. In this paper, we present Knock&Tap, an audio-based approach that can perform gesture classification and gesture localization using deep transfer learning. Knock&Tap consists of a single 4-microphone array to record the sound of the user's knocking and tapping gestures and a wood/glass panel for knocking and tapping. Knock&Tap can be used in a situation or environment where vision-based gesture recognition is impossible due to the lighting condition or camera installation issue. Various experiments were conducted to validate the feasibility of Knock&Tap with 7 gesture types on both wood and glass panels. Our experimental results show that Knock&Tap predicts the gesture type and location with an accuracy of up to 97.24% and 92.05%, respectively.

How Do HCI Researchers Describe Their Software Tools? Insights From a Synopsis Survey of Tools for Multimodal Interaction

Providing tools to support design and engineering of interactive computing systems has been encouraged in the HCI community. However, little is known about the practices adopted by HCI researchers to describe their software tools in academic publications. To address this aspect, we implemented a simplified literature survey procedure combining principles of population sampling and systematic literature reviews to enable rapid access to insights from a vast body of published academic work. We report that screenshots and diagrams are among the most widely used descriptive elements by HCI researchers to present their software tools, a finding that we capitalize on to reflect about the dissemination of tools for the design and engineering of multimodal interaction at the intersection of software engineering and HCI.

Multisensor-Pipeline: A Lightweight, Flexible, and Extensible Framework for Building Multimodal-Multisensor Interfaces

We present the multisensor-pipeline (MSP), a lightweight, flexible, and extensible framework for prototyping multimodal-multisensor interfaces based on real-time sensor input. Our open-source framework (available on GitHub) enables researchers and developers to easily integrate multiple sensors or other data streams via source modules, to add stream and event processing capabilities via processor modules, and to connect user interfaces or databases via sink modules in a graph-based processing pipeline. Our framework is implemented in Python with a low number of dependencies, which enables a quick setup process, execution across multiple operating systems, and direct access to cutting-edge machine learning libraries and models. We showcase the functionality and capabilities of MSP through a sample application that connects a mobile eye tracker to classify image patches surrounding the user’s fixation points and visualizes the classification results in real-time.

Detecting Face Touching with Dynamic Time Warping on Smartwatches: A Preliminary Study

Respiratory diseases such as the novel coronavirus (COVID-19) can be transmitted through people's face-touching behaviors. One of the official recommendations for protecting ourselves from such viruses is to avoid touching our eyes, nose, or mouth with unwashed hands. However, prior work has found that people touch their face 23 times per hour on average without realizing it. Therefore, in this Late-Breaking Work, we explore a possible approach to help users avoid touching their face in daily life by alerting them through a smartwatch application every time a face-touching behavior occurs. We selected 10 everyday activities including several that should be easy to distinguish from face touching and several that should be more challenging. We recruited 10 participants and asked them to perform each activity repeatedly for 3 minutes at their own pace while wearing a Samsung smartwatch. Based on the collected accelerometer data, we used dynamic time warping (DTW) to distinguish between the two groups of activities (i.e., face-touching and non-face-touching), which is a method well-suited for small datasets. Our findings show that the DTW-based classifier is capable of classifying the activities into two groups with high accuracy (i.e., 99.07% for the user-dependent scenario). We demonstrated that smartwatches have the potential to detect face-touching behaviors with the proposed methodology. Future work can explore other classification approaches, collect larger datasets, and consider other sensors to increase the robustness of our results.

Predicting Worker Accuracy from Nonverbal Behaviour: Benefits and Potential for Algorithmic Bias

With the rise of algorithmic management (and online work in general), there is a growing interest in techniques that can monitor worker performance. For example, if a system can automatically detect whether a worker is becoming distracted or disengaged, it can intervene to motivate the worker or flag their output as requiring further quality control. Prior research has explored the potential for detecting nonverbal cues that could predict mistakes (e.g., detecting boredom in students or fatigue in drivers). Here, we learn a model that reliably predicts worker accuracy from nonverbal behaviour during tedious and repetitive annotation tasks. We show that the annotation accuracy can be substantially improved by discarding annotations that are predicted to be of low accuracy by the model. While these results are promising, recent concerns about algorithmic bias led us to further investigate whether the accuracy is influenced by skin tone. Unfortunately, we find that the algorithm showed systematic bias that disadvantaged some dark-skinned workers and incorrectly rewarded some with lighter-skin. We discuss the apparent reasons for this bias and suggestions for how and if such methods could be deployed to enhance worker engagement.

Sensorimotor Synchronization in Blind Musicians: Does Lack of Vision Influencenon-verbal Musical Communication?

Musical communication can be considered as a form of social interaction and it requires a certain degree of inter-individual cortical synchronization. Scientific literature has shown that visual experience is necessary for social interaction and visual impairment negatively influences nonverbal communication both in children and adults. In this study, we present a pilot multimodal dataset to investigate the role that visual experience plays in the context of musical interaction. We selected a mixed interaction scenario (i.e., sighted and blind performers) to investigate non-verbal communication patterns. In particular, we recorded motion tracking data to analyze the interaction between two musicians in three different contexts with the same cellist as soloist accompanied by a (1) blind pianist, (2) sighted pianist, and (3) sighted but blindfolded pianist. Recordings comprised upper-body motion capture and audio data. We also investigated human perception of changes in non-verbal behaviors, by means of an online annotation questionnaire. Preliminary results are presented.

Group-Level Focus of Visual Attention for Improved Active Speaker Detection

This work addresses the problem of active speaker detection in physically situated multiparty interactions. This challenge requires a robust solution that can perform effectively across a wide range of speakers and physical contexts. Current state-of-the-art active speaker detection approaches rely on machine learning methods that do not generalize well to new physical settings. We find that these methods do not transfer well even between similar datasets. We propose the use of group-level focus of visual attention in combination with a general audio-video synchronizer method for improved active speaker detection across speakers and physical contexts. Our dataset-independent experiments demonstrate that the proposed approach outperforms state-of-the-art methods trained specifically for the task of active speaker detection.

SESSION: AAP'21 Workshop

Towards Chatbot-Supported Self-Reporting for Increased Reliability and Richness of Ground Truth for Automatic Pain Recognition: Reflections on Long-Distance Runners and People with Chronic Pain

Pain is a ubiquitous and multifaceted experience, making the gathering of ground truth for training machine learning system particularly difficult. In this paper, we reflect on the use of voice-based Experience Sampling Method (ESM) approaches for collecting pain self-reports in two different real-life case studies: long-distance runners, and people living with chronic pain performing housework activities. We report on the reflections emerging from these two qualitative studies in which semi-structured interviews were used to exploratively gather initial insights on how voice-based ESM could affect the collection of self-reports as ground truth. While frequent ESM questions may be considered intrusive, most of our participants found them useful, and even welcomed those question prompts. Particularly, they found that such voice-based questions facilitated in-the-moment self-reflection, and stimulated a sense of companionship leading to richer self-reporting, and possibly more reliable ground truth. We will discuss the ways in which participants benefitted from subjective self-reporting leading to an increased awareness and self-understanding. In addition, we make the case for the possibility of building a chatbot with ESM capabilities in order to gather more enhanced, refined but structured ground truth that combines pain ratings and their qualification. Such rich ground truth can provide could be seen as more reliable, as well as contributing to more refined machine learning models able to better capture the complexity of pain experience.

Automated Assessment of Pain: Prospects, Progress, and a Path Forward

Advances in the understanding and control of pain require methods for measuring its presence, intensity, and other qualities. Shortcomings of the main method for evaluating pain—verbal report—have motivated the pursuit of other measures. Measurement of observable pain-related behaviors, such as facial expressions, has provided an alternative, but has seen limited application because available techniques are burdensome. Computer vision and machine learning techniques have been successfully applied to the assessment of pain-related facial expression, suggesting that automated assessment may be feasible. Further development is necessary before such techniques can have more widespread implementation in pain science and clinical practice. Suggestions are made for the dimensions that need to be addressed to facilitate such developments.

SESSION: ASMMC'21 Workshop

BERT Based Cross-Task Sentiment Analysis with Adversarial Learning

Sentiment Analysis (SA) is an essential task in natural language processing. Generally, previous sentiment analysis models focus on a single subtask. However, a generalized SA agent is expected with the ability to learn knowledge from one task and use it in other relevant tasks. Consequently, we formulate this challenge as an unsupervised task adaption problem and propose TAL-IS, a simple and efficient approach to finetune cross-task SA model. In this approach, we use Task Adversarial Learning (TAL) with a BERT-specific Input Standardization (IS) scheme to obtain both emotion-discriminative and task-invariant contextual features. To the best of our knowledge, our work is the first attempt to propose a cross-task model for SA subtasks with unsupervised task adaption. Experiments show that our proposed model outperforms the general finetuning method and can learn knowledge effectively cross SA subtasks.

Aspect-based Sentiment Analysis with Weighted Relational Graph Attention Network

The aim of aspect-based sentiment analysis (ABSA) is to determine the sentiment polarity of a specific aspect in a sentence. Most recent works resort to exploiting syntactic information by utilizing Graph Attention Network (GAT) over dependency trees, and have achieved great progress. However, the models based on traditional GAT do not fully exploit syntactic information such as the diversified types of dependency relations. The variant of GAT called relational graph attention network (R-GAT) takes different types of dependency relations into consideration, but ignores the information hidden in the word-pairs. In this paper, we propose a novel model called weighted relational graph attention network (WRGAT). It can exploit more accurate syntactic information by employing a weighted relational head, in which the contextual information from word-pairs is introduced into the computation of the attention weights of dependency relations. Furthermore, we employ BERT instead of Bi-directional Long Short-term Memory (Bi-LSTM) to generate contextual representations and aspect representations respectively as inputs to the WRGAT, and adopt an index selection method to keep the word-level dependencies consistent with the word-piece unit of BERT. With the proposed BERT-WRGAT architecture, we achieve the state-of-the-art performances on four ABSA datasets.

Semantic and Acoustic-Prosodic Entrainment of Dialogues in Service Scenarios

According to the Communication Accommodation Theory, speakers dynamically adjust their communication behaviors, converging to or diverging from their interlocutors in order to diminish or increase social distance, which is called entrainment. Most of the studies investigated the entrainment of the interlocutors in terms of linguistic and paralinguistic features respectively, but paid less attention to the (dis)entrainment relation between paralinguistic and linguistic ones. In this study, we employed BERT to extract the semantic similarities of turns within dialogues in service scenarios, and found the semantic entrainment. We also found that (dis)entrainments policies were adopted between acoustic-prosodic (paralinguistic) and linguistic (semantic) features. These findings will contribute to fully understanding the mechanism of entrainment in dialogue.

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Voice conversion (VC) systems have made significant progress owing to advanced deep learning methods. Current research is not only concerned with high-quality and fast audio synthesis, but also richer expressiveness. The most popular VC system was constructed from the concatenation of an automatic speech recognition module with a text-to-speech module (ASR-TTS). Yet this system suffers from errors in recognition and pronunciation and it also requires a large amount of data for a pre-trained ASR mode l. We propose an approach to improve the model stability and training efficiency of a VC system. Firstly, a data redundancy reduction method is used to balance the distribution of vocabulary to avoid uncommon words being ignored during the training process; by adding connectionist temporal classification (CTC) loss, the word error rate (WER) of our system reduces to 3.02%, which is 5.63 percentage points lower than that of the ASR-TTS system (8.65%), and the inference speed (e.g., real-time rate 19.32) of our VC system is much higher than that of the baseline system (real-time rate 2.24). Finally, emotional embedding is added to the pre-trained VC system to generate expressive speech conversion. The results show that after fine-tuning on the multi-emotional dataset, the system can achieve high quality and expressive speech synthesis.

Facial Micro-Expression Recognition Based on Multi-Scale Temporal and Spatial Features

Micro-expression is a kind of facial activity with weak changes and short duration that can reflect people’s true feelings. For micro-expression recognition, it is not only necessary to extract the spatial feature information of the face movement changes in the image, but also to consider the time series information of the continuous image sequence. Thus, we propose a multiple aggregation networks to verify the impact of local facial regions and temporal features on micro-expression recognition in detail. It can learn the temporal and spatial feature from the whole micro expression video frame and combined the local region where the micro-expression mainly occurs with the global region. The spatial features of micro-expressions frames are extracted by 3D CNN, and the extracted video sequences features are input into LSTM processing temporal features. Experiments from two public datasets, CASME-II and SAMM, show that our method obtains higher performance than several existing studies.

FER by Modeling the Conditional Independence between the Spatial Cues and the Spatial Attention Distributions

This paper presents a novel approach for FER. The spatial cues, for example the locations of face components such as eyes and mouth, play an important role to guide the spatial attentions for FER. Traditional approaches define the relations between the spatial cues and the spatial attention distributions based on linear models. However there also exists non-linear relations between them, in which case the spatial cues and the spatial attention distributions can be conditional independent. In this paper we model the conditional independence based on the state-of-the arts framework of the attention models for FER. We design the spatial cues as the hyper-parameters to affect the metric for spatial attention calculation. We exploit the Global-Attention (no spatial cues), Local-Attention (spatial cues affect the attention distributions) and Self-Attention (spatial cues as the hyper-parameters to affect the attention metric) as three different configurations. The experimental results show that the Self-attention achieves the best performances (68.5% on FER2013 Dataset and 49.8% on EmotiW2017 Dataset) which improves the accuracies by 2.8 % (on FER2013) and 1% (on EmotiW2017) compared with the Global-attention. The experimental results support the idea that non-linear modeling the relations between the spatial cues and the spatial attention distributions can improve the performances for FER.

Efficient Gradient-Based Neural Architecture Search For End-to-End ASR

Neural architecture search (NAS) has been successfully applied to tasks like image classification and language modeling for finding efficient high-performance network architectures. In ASR field especially end-to-end ASR, the related research is still in its infancy. In this work, we focus on applying NAS on the most popular manually designed model: Conformer, and propose an efficient ASR model searching method that benefits from the natural advantage of differentiable architecture search (Darts) in reducing computational overheads. We fuse Darts mutator and Conformer blocks to form a complete search space, within which a modified architecture called Darts-Conformer cell is found automatically. The entire searching process on AISHELL-1 dataset costs only 0.7 GPU days. Replacing the Conformer encoder by stacking searched architecture, we get an end-to-end ASR model (named as Darts-Conformner) that outperforms the Conformer baseline by 4.7% relatively on the open-source AISHELL-1 dataset. Besides, we verify the transferability of the architecture searched on a small dataset to a larger 2k-hour dataset.

Temporal Attentive Adversarial Domain Adaption for Cross Cultural Affect Recognition

Continuous affect recognition is becoming an increasingly attractive research topic, recent works mainly focus on modeling the temporal dependency and multi-modal fusion to boost the performance. Despite recent improvement, the cross-cultural affect recognition in videos is still not well-explored. In this paper, we propose the temporal attentive adversarial domain adaption for cross cultural affect recognition. The LSTM is firstly used to encode the contextual representation for each frame. Then, a DNN based regressor is used to estimate the affective dimension arousal or valence, and optimized to promote the encoded representation is emotion discriminative. In addition, another DNN based sequence level culture classifier, which takes the fused representation of each frame as the input, is used to recognize the culture of the input sequence, and optimized to encourage the encoded representation is culture invariant. Since different frames over a video may contribute not equally in recognizing the culture, we propose to add another frame level culture classifier, which could adaptively and attentively assign more weighting scores for the important frames for recognizing the culture. The proposed method is evaluated on the benchmark dataset AVEC2019 CES. Our experimental results show that the proposed method improves the performance compared to state-of-the-art methods, with the concordance correlation coefficient (CCC) reaching 0.576 for arousal and 0.472 for valence, on the cross cultural test set.

Call For Help Detection In Emergent Situations Using Keyword Spotting And Paralinguistic Analysis

Nowadays, the safety of passengers within the enclosed public space, such as the elevator, becomes more and more important. Though the passengers can click the ”SOS” button to call the remote safety guard, the chances are that some passengers might lose their ability to stand up to click the button or it is not convenient to do so when in emergency situations. Also, people’s first reaction may be to call for help using voice instead of pressing the mayday button. Thus, we believe a speech-based system is very useful under this scenario. This work proposes a system using keyword spotting and paralinguistic analysis to detect whether the passenger calls for help in mandarin and gives real-time feedback which might provide the passenger within time help to prevent the accident. Unlike the standard keyword spotting task which is to detect the pre-defined call for help keyword ”jiu ming” in any scenario, we focus on detecting both the keyword and the paralinguistic states. The system will only be triggered when the keyword and the emergency situation such as shouting or screaming appear at the same time. To this end, we compare the performance of different methods and we find that the deep neural network-based small-footprint keyword spotting methods are effective and efficient for keyword spotting tasks under emotional scenarios.

A Multimodal Dynamic Neural Network for Call for Help Recognition in Elevators

As elevator accidents do great damage to people’s lives and property, taking immediate responses to emergent calls for help is necessary. In most emergency cases, passengers must use the “SOS” button to contact the remote safety guard. However, this method is unreliable when passengers lose the ability of body movement. To address this problem, we define a novel task of identifying real and fake calls for help in elevator scenes. Given that the limited call for help dataset collected in elevators contains multimodal data of real and fake categories, we collected and constructed an audiovisual dataset dedicated to the proposed task. Moreover, we present a novel instance-modality-wise dynamic framework to efficiently use the information from each modality and make inferences. Experimental results show that our multimodal network improves the performance on the call for help multimodal dataset by 2.66% (accuracy) and 1.25% (F1 Score) with respect to the pure audio model. Besides, our method outperforms other methods on our dataset.

A Web-Based Longitudinal Mental Health Monitoring System

Current clinical assessments of depression disorder are heavily relied on the questionnaire tables on patients’ daily behavior, sleeping, and mood status of the past two weeks. However, the information obtained through the patient’s review of the past two weeks’ experience is neither timely nor objective. Moreover, while patients have medicine at home, doctors lose the way of monitoring and intervening them on time. In this paper, we propose and implement a web-based longitudinal mental health monitoring system. On the user end, the patients can report their daily information through ecological momentary assessment (EMA), share their emotions in speech or face video, test their depression severity through the PHQ-9 questionnaire table or face videos recorded while going through a semi-structured interview, and check their recent history of activity, sleeping, emotion log, and depression severity etc. The server end implements emotion recognition and depression estimation on the pre-trained deep learning models. On the doctor end, the doctor can manage the information of all the patients under his(her) supervision, monitor their recent status, and edit their depression severity after clinical diagnosis.

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN

Speech coding aims at compressing digital speech signals with fewer bits and reconstructing it back to raw signals, maintaining the speech quality as much as possible. But conventional codecs usually need a high bit-rate to achieve reconstructed speech with reasonable high quality. In this paper, we propose an end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s. The compression process of speech coding is carried out by a down-sampling module of the encoder and a learnable discrete codebook. GAN is used to further improve the reconstructed quality. Our experiments confirm the effectiveness of the proposed model in both objective and subjective tests, which significantly outperforms the conventional codecs at low bit-rate in terms of speech quality and speaker similarity.

Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder

Generating high-quality singing voice usually depends on a sizable studio-level singing corpus which is difficult and expensive to collect. In contrast, there is plenty of singing voice data that can be found on the Internet. However, the found singing data may be mixed by accompaniments or contaminated by environmental noises due to recording conditions. In this paper, we propose a noise robust singing voice synthesizer which incorporates Gaussian Mixture Variational Autoencoder (GMVAE) as the noise encoder to handle different noise conditions, generating clean singing voice from lyrics for target speaker. Specifically, the proposed synthesizer learns a multi-modal latent noise representation of various noise conditions in a continuous space without the use of an auxiliary noise classifier for noise representation learning or clean reference audio during the inference stage. Experiments show that the proposed synthesizer can generate clean and high-quality singing voice for target speaker with MOS close to reconstructed singing voice from ground truth mel-spectrogram with Griffin-Lim vocoder. Experiments also show the robustness of our approach under complex noise conditions.

SESSION: CATS'21 Workshop

A Systematic Review on Dyadic Conversation Visualizations

As many services are provided through text and voice systems, including voice calls over the internet, messaging, and emails, there is a growing need for both individuals and organizations to understand these conversations better and find actionable insights to improve social skills. Effective visualizations that provide a lucid account of the conversation allow the user to explore such insights. In this paper, we present a systematic survey of the various methods of visualizing a conversation and research papers involving interactive visualizations and human participants. Findings from the survey show that there have been attempts to visualize most, if not all, of the types of conversation that are taking place digitally – from speech to messages and emails. Through this survey, we make two contributions. One, we summarize the current practices in the domain of visualizing dyadic conversations. Two, we provide suggestions for future dialogue visualization research.

An Opportunity to Investigate the Role of Specific Nonverbal Cues and First Impression in Interviews using Deepfake Based Controlled Video Generation

The study of nonverbal cues in a dyadic interaction, such as a job interview, mostly relies on videos and does not allow to disentangle the role of specific cues. It is thus not clear whether, for instance, an interviewee who smiles while listening to an interviewer would be perceived more favorably than an interviewee who only gazes at an interviewer. While a similar analysis in naturalistic situations requires careful curation of interview recordings, it still does not allow to disentangle the effect of specific nonverbal cues on first impression. Deepfake technology provides the opportunity to address this challenge by creating highly standardized videos of interviewees manifesting a determined behavior (i.e., a combination of specific nonverbal cues). Accordingly, we created a set of deepfake videos enabling us to manipulate the occurrence of three classes of nonverbal attributes (i.e., eye contact, nodding, and smiling). The deepfake videos showed interviewees manifesting one of four behaviors while listening to the interviewer: eye contact with smile and nod, eye contact with only nodding, just eye contact, and looking distracted. Then we tested whether these combinations of nonverbal cues influenced how the interviewees were perceived with respect to personality, confidence, and hireability. Our work reveals the potential of using deepfake technology for generating behaviorally controlled videos, useful for psychology experiments.

Making Automatic Movement Features Extraction Suitable for Non-engineer Students

Analysis of movement expression is a multidisciplinary research domain, that exploits the contributions from a wide variety of research fields, ranging from biomedical, computer science and robotic engineering, moving through psychology, to dance and performing arts. That is why there is the need to make tools also accessible to students and researchers with a non-technical background to foster their insights and facilitate their contribution. Since corpora are usually multi-modal, when they come as motion capture (MoCap) data, they could be quite difficult to analyze and annotate by people with a non-technical background. Therefore, the present work shows the prototype of a software tool that collects a library of algorithms, to process raw MoCap data. The tool allows the user to extract movement features through an easy workflow, interacting with a user-friendly graphical interface (GUI). The GUI usability has been preliminary user-tested with participants having different expertise in human movement features extraction. During the execution of three tasks, users’ attitudes have been collected to assess GUI’s ease of use and we found that it is perceived as a useful tool, but it requires basic previous knowledge to be fully understood.

ChiCo: A Multimodal Corpus for the Study of Child Conversation

The study of how children develop their conversational skills is an important scientific frontier at the crossroad of social, cognitive, and linguistic development with important applications in health, education, and child-oriented AI. While recent advances in machine learning techniques allow us to develop formal theories of conversational development in real-life contexts, progress has been slowed down by the lack of corpora that both approximate naturalistic interaction and provide clear access to children’s non-verbal behavior in face-to-face conversations. This work is an effort to fill this gap. We introduce ChiCo (for Child Conversation), a corpus we built using an online video chat system. Using a weakly structured task (a word-guessing game), we recorded 20 conversations involving either children in middle childhood (i.e., 6 to 12 years old) interacting with their caregivers (condition of interest) or the same caregivers interacting with other adults (a control condition), resulting in 40 individual recordings. Our annotation of these videos has shown that the frequency of children’s use of gaze, gesture and facial expressions mirrors that of adults. Future modeling research can capitalize on this rich behavioral data to study how both verbal and non-verbal cues contribute to the development of conversational coordination.

IdlePose : A Dataset of Spontaneous Idle Motions

When animating and giving life to a virtual character, it is important to consider the idling behaviours of the character as well. Like any other animations, these could be recorded and handcrafted, or, they could be generated by a motion model. Such models are theoretically capable of producing and simulating variable motions in an automatic fashion, alleviating the work of animators who can then focus on more expressive behaviours. While there is a growing interest in motion models built on data, recording enough spontaneous human motions and behaviours for the learning of the models is challenging. The setting in which the recording of the data happens is usually unnatural for the participants. In this paper, we present a data collection which protocol was designed for eliciting and capturing natural and spontaneous human idle motions. This protocol works by hiding from the participant the true intent of the data collection in order to genuinely make them wait. The dataset we collected using this protocol is also presented and is made available to the community of researchers.

Setting Up a Health-related Quality of Life Vocabulary

Quality of life (QoL) aspects of health have been gaining increasing attention in recent years. Despite their essential contribution to the overall health of a person, QoL is still less understood than other health conditions. While it is possible to locate text mining approaches to extract and annotate verbal behaviours and comments with regard to specific health conditions, the same is not true to Perceived Health Impacts related to Quality of Life (H-QoL). This paper explores the usefulness and potential of the World Health Organisation Quality of Life Instrument (WHOQoL-100) as a step towards the creation of a Health-related Quality of Life Vocabulary. In doing so, this study validates a vocabulary of 15 concepts based on the WHOQoL-100 assessment instrument with six medical professionals and contributes a curated dictionary of 333 terms.

A Development of a Multimodal Behavior Analysis System for Evaluating Dementia Care Interaction

People with dementia may benefit from appropriate care. Understanding care interactions provide meaningful insight into social communication skills. The purpose of this paper is to construct an annotation scheme to represent expert knowledge and to verify whether it leads to the evaluation of care interactions. Focusing on the dementia care method by Humanitude, we have designed an annotation scheme and annotation structure. A multimodal behavior analysis system has been developed to analyze care interaction between caregivers and people with dementia. The point of our system is that it can generate a deep interpretation of care without experts. The video data which we have collected at a hospital have been analyzed, the features of skills and care interactions were extracted. These results are a part of the findings of empirical analysis of building human relationships.

SESSION: EIR'21 Workshop

When a Voice Assistant Asks for Feedback: An Empirical Study on Customer Experience with A/B Testing and Causal Inference Methods

Intelligent Voice Assistant (IVA) systems, such as Alexa, Google Assistant and Siri, allow us to interact with them using just the voice commands. IVA systems can seek voice feedback directly from the customers, right after an interaction by simply asking a question such as “did that answer your question?”. We refer to these IVA elicited feedbacks as crowdsourced voice feedback (CVF). In this paper, we look to understand the customer experience (CX) during interactions with an IVA that explicitly seeks feedback. We attempt to quantify the CX of providing feedback, identify the driving factors of CX and offer insights into improving CX with the drivers identified. With an A/B test, we collected data from a leading IVA system and found that feedback elicitations did not impair CX in general. To identify drivers of CX, we performed causal inference with Double Machine Learning. Causal inference teases apart multiple confounding factors and avoids CX risks in experimentation of certain variables. We identified multiple CX drivers including elicitation timing and frequency, which can be useful in establishing guardrails for a CVF system. Our results imply opportunities of CVF systems, and we suggest design specifics that can be leveraged for such feedback collection mechanisms.

Uncertainties based queries for Interactive policy learning with evaluations and corrections

SESSION: GENEA'21 Workshop

Probabilistic Human-like Gesture Synthesis from Speech using GRU-based WGAN

Gestures are crucial for increasing the human-likeness of agents and robots to achieve smoother interactions with humans. The realization of an effective system to model human gestures, which are matched with the speech utterances, is necessary to be embedded in these agents. In this work, we propose a GRU-based autoregressive generation model for gesture generation, which is trained with a CNN-based discriminator in an adversarial manner using a WGAN-based learning algorithm. The model is trained to output the rotation angles of the joints in the upper body, and implemented to animate a CG avatar. The motions synthesized by the proposed system are evaluated via an objective measure and a subjective experiment, showing that the proposed model outperforms a baseline model which is trained by a state-of-the-art GAN-based algorithm, using the same dataset. This result reveals that it is essential to develop a stable and robust learning algorithm for training gesture generation models. Our code can be found in https://github.com/wubowen416/gesture-generation.

Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture

Crossmodal grounding is a key technical challenge when generating relevant and well-timed gestures from spoken language. Often, the same gesture can accompany semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a gesture (semi-circular with both hands) could co-occur with semantically different phrases ”entire bottom row” (referring to a physical point) and ”molecules expand and decay” (referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn representations better suited to such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE, that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. This gesture-aware space can help us generate more relevant gestures given language as input. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative evaluations. Our proposed methodology significantly outperforms prior approaches for gestures-language grounding. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.

Influence of Movement Energy and Affect Priming on the Perception of Virtual Characters Extroversion and Mood

Movement Energy – physical activeness in performing actions and Affect Priming – prior exposure to information about someone’s mood and personality might be two crucial factors that influence how we perceive someone. It is unclear if these factors influence the perception of virtual characters in a way that is similar to what is observed during in-person interactions. This paper presents different configurations of Movement Energy for virtual characters and two studies about how these influence the perception of the characters’ personality, extroversion in particular, and mood. Moreover, the studies investigate how Affect Priming (Personality and Mood), as one form of contextual priming, influences this perception. The results indicate that characters with high Movement Energy are perceived as more extrovert and in a better mood, which corroborates existing research. Moreover, the results indicate that Personality and Mood Priming influence perception in different ways. Characters that were primed as being in a positive mood are perceived as more extrovert, whereas characters that were primed as being introverted are perceived in a more positive mood.

SESSION: IGTD'21 Workshop

Belongingness and Satisfaction Recognition from Physiological Synchrony with A Group-Modulated Attentive BLSTM under Small-group Conversation

Physiological synchrony is a particular phenomenon of physiological responses during a face-face conversation. However, while many previous studies proposed various physiological synchrony measures between interlocutors in dyadic conversations, very few works on computing physiological synchrony in small groups (three or more people). Besides, belongingness and satisfaction are two critical factors for humans to decide where group they want to stay. Therefore, we want to investigate and reveal the relationship between physiological synchrony and belongingness/satisfaction under group conversation in this preliminary work. We feed the physiology of group members into a designed learnable graph structure with the group-level physiological synchrony and heart-related features computed from Photoplethysmography (PPG) signals. We then devise a Group-modulated Attentive Bi-directional Long Short-Term Memory (GGA-BLSTM) model to recognize groups’ three levels of belongingness and satisfaction (low, middle, and high). Finally, we evaluate the proposed method on our recently collected multimodal group interaction corpus (never published before), NTUBA. The results show that (1) the models trained jointly with the group-level physiological synchrony and the conventional heart-related features consistently outperforms the model only trained with the conventional features, and (2) the proposed model with a Graph-structure Group-modulated Attention mechanism (GGA), GGA-BLSTM, performs better than the robust baseline model, the attentive BLSTM. Finally, the GGA-BLSTM achieves a good unweighted average recall (UAR) of 73.3% and 82.1% on group satisfaction and belongingness classification tasks, respectively. In further analyses, we reveal the relationships between physiological synchrony and group satisfaction/belongingness.

Self-assessed Emotion Classification from Acoustic and Physiological Features within Small-group Conversation

Individual (personalized) self-assessed emotion recognition has recently received more attention, such as Human-Centered Artificial Intelligence (AI). In most previous studies, researchers utilized the physiological changes and reactions in the body evoked by multi-media stimuli, e.g., video or music, to build a model for recognizing individuals’ emotions. However, this elicitation approach is less impractical in the human-human interaction because the conversation is dynamic. In this paper, we firstly investigate the individual emotion recognition task under three-person small group conversations. While predicting personalized emotions from physiological signals is well-studied, few studies focus on emotion classification (e.g., happiness and sadness). Most prior works only focus on binary dimensional emotion recognition or regression, such as valence and arousal. Hence, we formulate the individual emotion recognition task into an individual-level emotion classification. In the proposed method, we consider the physiological changes in each individual’s body and acoustic turn-taking dynamics during group conversations for predicting individual emotions. Meanwhile, we assume that the emotional states of humans might be affected by the expressive behaviors of other members during group conversations. Also, we hypothesize that people have a higher probability of feeling specific emotions under the related emotional atmosphere. Therefore, we design an ad-hoc technique by simply summing up the Self-assessed emotional annotations of all group members as the group emotional atmosphere (climate) to help the model predict individuals’ emotions. We propose a Multi-modal Multi-label Emotion based on Transformer BLSTM at Group Emotional Atmosphere Network (MMETBGEAN) that explicitly considers individual changes and dynamic interaction via physiological and acoustic features during a group conversation integrates group emotional atmosphere information for recognizing individuals’ multi-label emotions. We assess the proposed framework on our recently collected extensive Mandarin Chinese collective task group database, NTUBA. The results show that the method outperforms the existing approaches on multi-modal multi-label emotion classification on this database.

On the Sound of Successful Meetings: How Speech Prosody Predicts Meeting Performance

This paper investigates the degree to which meeting success can be predicted through holistic, acoustic-prosodic measurements. The analyzed meetings are taken from the Parking Lot Corpus in which 70 groups of three to six students discuss the traffic situation at their university and come up with parking and transportation recommendations. The number, feasibility, and quality of these recommendations as well as the mean effectiveness and satisfaction ratings across group members provide the basis for correlations with three sets 15 acoustic-prosodic features that cover pitch, duration/timing, intensity, and the absolute frequencies of local events such as silent pauses. Results show that meeting success is, in fact, considerably correlated with the overall “sound” of the individual meetings, with pitch features being the most diverse and powerful predictors. In addition, we found that the “sound” of subjectively effective meetings differs from the “sound” of objectively productive meetings, i.e. meetings that generate a high output of feasible and/or high-quality recommendations. The prosodic feature patterns suggest that effective meetings are short and matter-of-fact, whereas the productive meetings are longer and have a lively speech melody that makes these meetings stimulating. We discuss the implications of our findings for future research and technological innovation.

Get Together in the Middle-earth: a First Step Towards Hybrid Intelligence Systems

In the last decade, the number of computer systems using AI has increased dramatically. To date, indeed, AI is present in almost all the aspects of the human everyday life. This resulted in the attempt of scholars in Computer Science to endow machines with human-like socio-cognitive skills and/or human-like embodiment to try to improve interactions. Such an approach, however, highlights several crucial issues related to the substantial differences between fine-grained human skills and what machines can do and learn. So, although being expensive and sophisticated tools, machines tend to be “idiots savants”. Hybrid Intelligence (HI) is aimed to tackle this issue by proposing, as Akata and colleagues say, “systems that operate as mixed teams, where humans and machines cooperate synergistically, proactively, and purposefully to achieve shared goals”. To our knowledge, however, HI is at a very early exploratory stage, and few concrete solutions to deal with it exist. In this position paper we introduce and briefly describe “Middle-Earth”, a conceptual and experimental ground to study HI. Moreover, we present a first prototype of a software platform based on immersive VR environments, on which we plan to carry out in the future the first pioneering experiments on teams of humans and/or AI-driven agents getting together in Middle Earth to perform collaborative tasks.

A Hitchhiker’s Guide towards Transactive Memory System Modeling in Small Group Interactions

Modeling Transactive Memory System (TMS) over time is an actual challenge of Human-Centered Computing. TMS is a group’s meta-knowledge indicating the attribute of “who knows what”. Conceiving and developing machines able to deal with TMS is a relevant step in the field of Hybrid Intelligence aiming at creating systems where human and artificial teammates cooperate in synergistic fashion. Recently, a TMS dataset has been proposed, where a number of audio and visual automated features and manual annotations are extracted taking inspiration from Social Sciences literature. Is it possible, on top of these, to model relationships between these engineered features and the TMS scores? In this work we first build and discuss a processing pipeline; then we propose four possible classifiers, two of which are artificial neural networks-based. We observe that the largest obstacle towards modeling the target relationships currently lies in the little data availability for training an automatic system. Our purpose, with this work, is to provide hints on how to avoid some common pitfalls to train these systems to learn TMS scores from audio/visual features.

An Exploratory Computational Study on the Effect of Emergent Leadership on Social and Task Cohesion

Leadership is a complex and dynamic phenomenon that has received a lot of attention from psychologists over the last 50 years, primarily due to its relationships with team effectiveness and performances. Depending on the group (e.g., size, relationships among members) and the context (e.g., solving a task under pressure), various styles of leadership could emerge. These styles can either be formally decided or manifest informally. Among the informal types of leadership, emergent leadership is one of the most studied. It is an emergent state that develops over time in a group and that interplays with other emergent states such as cohesion. Only a few computational studies focusing on predicting emergent leadership take advantage of the relationships with other phenomena to improve their models’ performances. These approaches, however, only apply to their models aimed at predicting emergent leadership. There is, to the best of our knowledge, no approach that integrates emergent leadership into computational models of cohesion.

In this study, we take a first step towards bridging this gap by introducing 2 families of approaches inspired by Social Sciences’ insights to integrate emergent leadership into computational models of cohesion. The first family consists of amplifying the differences between leaders’ and followers’ features while the second one focuses on adding leadership representation directly into the computational model’s architecture. In particular, for each family, we describe 2 approaches that are applied to a Deep Neural Network model aimed at predicting the dynamics of cohesion across various tasks over time. This study explores whether and how applying our approaches improves the prediction of the dynamics of the Social and Task dimensions of cohesion. Therefore, the performance of a computational model of cohesion that does not integrate the interplay between cohesion and emergent leadership is compared with the same computational models that apply our approaches. Results show that approaches from both families significantly improved the prediction of the Task cohesion dynamics, confirming the benefits of integrating emergent leadership following Social Psychology’s insights to enforce computational models of cohesion at both feature and architecture levels.

Clustering and Multimodal Analysis of Participants in Task-Based Discussions

Participants in task-based conversational interactions are clustered using outcomes of interest that include task performance, satisfaction ratings, and demographic traits. Each cluster is described in terms of the member participants’ common characteristics, and we perform participant outlier detection as well. We extract multimodal features of the conversational interaction and analyze how the participant groups differ in terms of these features.

Discovering Where We Excel: How Inclusive Turn-Taking in Conversation Improves Team Performance

In this paper, we examined how inclusive turn-taking in team conversation improves performance. Inclusive turn-taking is defined as a collective speaking pattern where different team members speak in succession. This stands in contrast to exclusive turn-taking, where individual members monopolize the speaking turns. We developed an algorithm to measure inclusive turn-taking in team dialogue. We theorized and tested the indirect effects of team inclusive turn-taking on performance via team skill use, and the moderation effects of team task strategy using a sample of 150 participants randomly assigned to three-person teams.

SESSION: MAAE'21 Workshop

Multimodal Assessment of Network Music Performance

The most common method of assessing the Quality of Musician’s Experience (QoME) in Network Music Performance (NMP) is to perform a subjective study, where the participants evaluate their experience via questionnaires. Translating experiences into metrics is not an exact science, though: in our recent study on the effects of audio delay and quality in the QoME of NMP, the responses had a high variance and were inconsistent. To strengthen our confidence in the results of the subjective study, we analyzed video recordings of the participants using machine learning. Specifically, we used Facial Expression Recognition (FER) to detect the emotions felt by the participants and then compare them with their questionnaire responses. In addition to pointing out interesting phenomena that were not apparent from the questionnaires, this multimodal analysis showed analogies between the emotions felt (as captured by FER) and the emotions expressed (as captured by the responses).

When Emotions are Triggered by Single Musical Notes: Revealing the Underlying Factors of Auditory-Emotion Associations

Can emotion be experienced when the auditory sense is stimulated by a single musical note (Q1), and do variables such as musical skills, age, and personality traits have an influence in auditory-emotion associations (Q2)? An experiment was conducted, in which 130 participants were asked to listen to single musical notes and rate their experienced emotional state. They also had to rate their musical proficiency, sound sensitivity, strongest learning style, and complete a reduced version of the Big-Five personality test (BFI-10). Results regarding Q1 show a correlation between lower notes and sadness, and higher notes and joy, confirming previous auditory-emotion association research, while presenting new knowledge into how emotion associates with single musical notes. Results regarding Q2 show that musical proficiency (low vs high), learning style (aural vs physical), personality (level of Conscientiousness) had an effect on how participants emotionally experienced single musical notes. The results presented in this study will provide a starting point that can help develop a new auditory-visual framework that uses understandings on emotion, personality and other variables in the development of more personalised human-computer interfaces. This new framework can be used in applications that can help in learning to paint or play an instrument; promoting positive mental health, or exploring new forms of creative expression e.g., writing a song with a paint brush as the instrument or painting a picture with a piano as your brush.

ArtBeat – Deep Convolutional Networks for Emotional Inference to Enhance Art with Music

Paintings and music are two universal forms of art that are present across all cultures and times in human history. In this paper, we present ArtBeat, a machine learning application to connect the two. Not only are these two art forms universal, but they are also deeply emotionally charged. This emotional factor is what we use as a bridge between the mediums. Using a Convolutional Neural Network (CNN), we aimed to create a model that can classify the emotions evoked by a painting, and use the predicted values to pair it with a piece of music to complement the viewing experience. Our system uses a pre-trained Wide ResNet model as a base, which we then fine-tuned. In this paper, we describe the design and implementation of this model as well as report its results and analyze its behaviour.

SESSION: MSECP'21 Workshop

Clustering of Physiological Signals by Emotional State, Race, and Sex

In this work, we explore the emotional responses to ten stimuli reflecting real-world experiences captured via physiological signals from 140 individuals. We employ the DBSCAN clustering algorithm to these data, and show that blood pressure and electrodermal activity may be indicative of race, and blood pressure of sex and emotional state. These findings could lead to important innovations, particularly those valuable for certain demographic groups, including, for example, culturally relevant robotics and cultural awareness in education by improving real-time measurements of stress and cognitive load.

Addressing Data Scarcity in Multimodal User State Recognition by Combining Semi-Supervised and Supervised Learning

Detecting mental states of human users is crucial for the development of cooperative and intelligent robots, as it enables the robot to understand the user’s intentions and desires. Despite their importance, it is difficult to obtain a large amount of high quality data for training automatic recognition algorithms as the time and effort required to collect and label such data is prohibitively high. In this paper we present a multimodal machine learning approach for detecting dis-/agreement and confusion states in a human-robot interaction environment, using just a small amount of manually annotated data. We collect a data set by conducting a human-robot interaction study and develop a novel preprocessing pipeline for our machine learning approach. By combining semi-supervised and supervised architectures, we are able to achieve an average F1-score of 81.1% for dis-/agreement detection with a small amount of labeled data and a large unlabeled data set, while simultaneously increasing the robustness of the model compared to the supervised approach.

Meta-Learning for Emotion Prediction from EEG while Listening to Music

We are studying to realize an emotion induction system that generates music based on emotions predicted in real-time from electroencephalogram (EEG). Since there are individual differences in EEG while listening to music, a model trained from a single participant’s data is expected to provide highly accurate emotion prediction. However, a time-consuming EEG recording is required to avoid data shortage. We need to reduce the recording time to minimize the burden on the participants. Therefore, we train a model which considers the individuality of EEG from multiple participants’ data and fine-tune it from a small amount of a single target’s data. In this paper, we propose a method using meta-learning for pre-training. We compared three methods: two methods using multiple participants’ data (with/without meta-learning) and a method using a single participant’s data. Our proposed method obtained the lowest RMSE (valence: 0.244 and arousal: 0.287). We demonstrate the effectiveness to use meta-learning to train an emotion prediction model, which is a necessary step for constructing the emotion induction system.

Towards Reliable Multimodal Stress Detection under Distribution Shift

The recognition of stress is an important issue from a health care perspective as well as in the human-computer interaction context. With the help of multimodal sensors, stress can be detected relatively well under laboratory conditions. However, when models are used in the real world, shifts in the data distribution can occur, often leading to performance degradation. It is therefore desirable that models in these scenarios are at least able to accurately capture this uncertainty and thus know what they do not know. This work aims to investigate how synthetic shifts in the data distribution can affect the reliability of a multimodal stress detection model in terms of calibration and uncertainty quantification. We compare a baseline with three known approaches that aim to improve reliability of uncertainty estimates. Our results show that all methods we tested improve the calibration. However, calibration generally deteriorates and spreads with stronger shifts for all approaches. They perform especially poorly for shifts in highly relevant modalities. Overall, we conclude that in the conducted experiments the investigated methods are not sufficiently reliable under distribution shifts.

Mindscape: Transforming Multimodal Physiological Signals into an Application Specific Reference Frame

In order to effectively generate actionable user insights using biometric data, a deep understanding of the psychophysiological processes involved is required. However, despite a few notable commercial exceptions, psychophysiology remains primarily an academic discipline. Isolating the video game sector as a case study, this work brushes on some of the factors that hold back the adoption of psychophysiology in the industry and introduces an unsupervised approach that aims to facilitate the adoption of multimodal physiological data in product development and decision making.

Neuromuscular Performance and Injury Risk Assessment Using Fusion of Multimodal Biophysical and Cognitive Data: In-field Athletic Performance and Injury Risk Assessment

Athletes rely on rationally bounded decisions of coaches and sports physicians to optimize performance, improve well-being, and reduce risk of injuries. These decisions are subjective or require costly tests that are not necessarily predictive of in-game performance or cannot predict risk of injury. This paper presents an approach to remedy this shortcoming by providing coaches and sports medicine teams with reliable tools for objective, quantitative assessment of in-field performance and risk of injury. The proposed method uses advanced physiological signal processing, data driven modelling, and multi-modal data fusion techniques applied to data recorded from unobtrusive wearable sensors in tasks and conditions that closely resemble those observed in the field during training or even a game. We postulate that the required data for this prediction task include joint kinematics from inertial measurement units or accelerometers, muscle surface electromyography, ground reaction force, electrocardiography, heart rate and heart rate variability, oxygen saturation, respiration rate, and pupillometry data. The required analysis methods include physiological signal processing, feature extraction, and data-driven modeling techniques to estimate neuromuscular properties, identify joint and leg stiffness, and assess cognitive performance from pupillometry and heart rate variability.

Towards Human-in-the-Loop Autonomous Multi-Robot Operations

Rapid advances in artificial intelligence are driving applications of robotics and automation in transport and logistics, providing new solutions to highway systems, passenger transport, last-mile delivery, and automated warehouses. Because the environment is dynamic and not entirely knowable, human supervision will be needed for the foreseeable future to solve unexpected problems. Research in other domains suggests that human-robot teams may even offer the optimal solution, as the relative strengths of human and artificial intelligence can be combined. However, realizing human-robot complementarity is challenging because sophisticated AI techniques lead to increasingly opaque robot control programs, and operators’ cognitive capacities can be exceeded as the sizes of autonomous fleets grow. We present our vision and initial results for a framework for human supervisory control and human robot teamwork for automated multi-vehicle systems.

SESSION: SIAIH'21 Workshop

Listen to the Real Experts: Detecting Need of Caregiver Response in a NICU using Multimodal Monitoring Signals

Vital signs are used in Neonatal Intensive Care Units (NICUs) to monitor the state of multiple patients at once. Alarms are triggered if a vital sign is below/above a predefined threshold. Numerous alarms sound each hour which could translate into an overload for the medical team, known as alarm fatigue. Yet many of these alarms do not require immediate clinical action of the caregivers.

In this paper we automatically detect moments that need an immediate response (i.e. interaction with the patient) of the medical team in NICUs by using caregiver response to the patient, which is based on the interpretation of vital signs and of nonverbal cues (e.g. movements) delivered by patients. The ultimate goal of such approach is to reduce the overload of alarms while maintaining the patient safety.

We use features extracted from the electrocardiogram (ECG) and pulse oxymetry (SpO2) sensors of the patient, as most unplanned interactions between patient and caregivers are due to deteriorations. Since in our unit an alarm can only be paused or silenced manually at the bedside, we used this information as a prior for caregiver response. We also propose different labeling schemes for classification, each representative of a possible interaction scenario within the nature of our problem.

We accomplished a general detection of caregiver response with a mean AUC of 0.82. We also show that when trained only with stable and truly deteriorating (critical state) samples, the classifiers can better learn the difference between alarms that need no immediate response and those that do. In addition, we present an analysis of the posterior probabilities over time for different labeling schemes, and use it to speculate about the reasons behind some failure cases.

Non-Verbal behaviors analysis of healthcare professionals engaged with a Virtual-Patient

Virtual-Patients (VP) are currently developed to train healthcare professionals in several domains. In this paper, we specifically explore non-verbal behaviors of healthcare professionals engaged in an interaction with a VP that displays a neurodegenerative disease. The main motivation is to contribute to the training of healthcare professionals with a focus on non-verbal behaviors, which are known to play an important role in patient- caregivers interaction. Our paper presents the VirtuAlZ corpus which is a video corpus of 29 professional caregivers interacting with a VP. Based on the literature and exploratory studies, we developed an architecture able to perceive a list of non-verbal signals, which are then transformed in discrete symbols. An N-gram based approach is then exploited to model, analyze and compare healthcare professional strategies. In particular, we report analysis on the work experience context and we cluster the participants in order to understand the different patterns of behavior present in our corpus.

Computational Measurement of Motor Imitation and Imitative Learning Differences in Autism Spectrum Disorder

Motor imitation is a critical developmental skill area that has been strongly and specifically linked to autism spectrum disorder (ASD). However, methodological variability across studies has precluded a clear understanding of the extent and impact of imitation differences in ASD, underscoring a need for more automated, granular measurement approaches that offer greater precision and consistency. In this paper, we investigate the utility of a novel motor imitation measurement approach for accurately differentiating between youth with ASD and typically developing (TD) youth. Findings indicate that youth with ASD imitate body movements significantly differently from TD youth upon repeated administration of a brief, simple task, and that a classifier based on body coordination features derived from this task can differentiate between autistic and TD youth with 82% accuracy. Our method illustrates that group differences are driven not only by interpersonal coordination with the imitated video stimulus, but also by intrapersonal coordination. Comparison of 2D and 3D tracking shows that both approaches achieve the same classification accuracy of 82%, which is highly promising with regard to scalability for larger samples and a range of non-laboratory settings. This work adds to a rapidly growing literature highlighting the promise of computational behavior analysis for detecting and characterizing motor differences in ASD and identifying potential motor biomarkers.

Differentiating Surgeons’ Expertise solely by Eye Movement Features

Medical schools are increasingly seeking to use objective measures to assess surgical skills. This extends even to perceptual skills, which are particularly important in minimally invasive surgery. Eye tracking provides a promising approach to obtaining such objective metrics of visual perception. In this work, we report on results of a cadaveric study of visual perception during shoulder arthroscopy. We present a model for classifying surgeons into three levels of expertise using only eye movements. The model achieves a classification accuracy of 84.44% using only a small set of selected features. We also examine and characterize the changes in visual perception metrics between the different levels of expertise, forming a basis for development of a system for objective assessment.

SESSION: SAMIH'21 Workshop

Social Robots to Support Gestural Development in Children with Autism Spectrum Disorder

Children with Autism Spectrum Disorders (ASD) are characterized by impairments in communication and social skills, including problems in understanding and producing gestures. Using the approach of robot-based imitation games, in this paper, we propose the prototype of an imitation game that aims at improving the non-verbal communication skills, gestures in particular, of children with ASD. Starting from an application that we developed in another domain, social inclusion of migrant children, we use a social robot to teach them to recognize and produce social gestures through an imitation game. For allowing the recognition of gestures by the robot, we learned a LSTM-based model using MediaPipe for the analysis of hands positions and landmarks. The model was trained on six selected gestures for recognizing their pattern. The module is then used by the robot in the game. Results from the software accuracy point of view are encouraging and show that the proposed approach is suitable for the purpose of showing and recognizing predefined gestures, however we are aware that in the wild with ASD children it might not work properly. For this reason, in the near future, we will perform a study aiming at assessing the efficacy of the approach with ASD children and revise the model and the game accordingly.

A Framework for the Assessment and Training of Collaborative Problem-Solving Social Skills

In this article, we describe a new experimental protocol. We propose to collect social interactions and the associated scales were selected to annotate the collected interactions. Three collaborative games were defined to support the study of social interaction during Collaborative Problem Solving. Three dyads of participants were recorded while solving these three collaborative games via a video conferencing system. We explain how the collected behaviors and social interactions were annotated using two scales and three human raters. The results indicate moderate to excellent reliability of these scales. We intend to use the resulting corpus by recruiting more subjects to explore the relations between attention and social interactions, and also for inspiring the design and validation of virtual characters for social skills training.

BERT meets LIWC: Exploring State-of-the-Art Language Models for Predicting Communication Behavior in Couples’ Conflict Interactions

Many processes in psychology are complex, such as dyadic interactions between two interacting partners (e.g., patient-therapist, intimate relationship partners). Nevertheless, many basic questions about interactions are difficult to investigate because dyadic processes can be within a person and between partners, they are based on multimodal aspects of behavior and unfold rapidly. Current analyses are mainly based on the behavioral coding method, whereby human coders annotate behavior based on a coding schema. But coding is labor-intensive, expensive, slow, focuses on few modalities, and produces sparse data which has forced the field to use average behaviors across entire interactions, thereby undermining the ability to study processes on a fine-grained scale. Current approaches in psychology use LIWC for analyzing couples’ interactions. However, advances in natural language processing such as BERT could enable the development of systems to potentially automate behavioral coding, which in turn could substantially improve psychological research. In this work, we train machine learning models to automatically predict positive and negative communication behavioral codes of 368 German-speaking Swiss couples during an 8-minute conflict interaction on a fine-grained scale (10-seconds sequences) using linguistic features and paralinguistic features derived with openSMILE. Our results show that both simpler TF-IDF features as well as more complex BERT features performed better than LIWC, and that adding paralinguistic features did not improve the performance. These results suggest it might be time to consider modern alternatives to LIWC, the de facto linguistic features in psychology, for prediction tasks in couples research. This work is a further step towards the automated coding of couples’ behavior which could enhance couple research and therapy, and be utilized for other dyadic interactions as well.

“You made me feel this way”: Investigating Partners’ Influence in Predicting Emotions in Couples’ Conflict Interactions using Speech Data

How romantic partners interact with each other during a conflict influences how they feel at the end of the interaction and is predictive of whether the partners stay together in the long term. Hence understanding the emotions of each partner is important. Yet current approaches that are used include self-reports which are burdensome and hence limit the frequency of this data collection. Automatic emotion prediction could address this challenge. Insights from psychology research indicate that partners’ behaviors influence each other’s emotions in conflict interaction and hence, the behavior of both partners could be considered to better predict each partner’s emotion. However, it is yet to be investigated how doing so compares to only using each partner’s own behavior in terms of emotion prediction performance. In this work, we used BERT to extract linguistic features (i.e., what partners said) and openSMILE to extract paralinguistic features (i.e., how they said it) from a data set of 368 German-speaking Swiss couples (N = 736 individuals) who were videotaped during an 8-minutes conflict interaction in the laboratory. Based on those features, we trained machine learning models to predict if partners feel positive or negative after the conflict interaction. Our results show that including the behavior of the other partner improves the prediction performance. Furthermore, for men, considering how their female partners spoke is most important and for women considering what their male partner said is most important in getting better prediction performance. This work is a step towards automatically recognizing each partners’ emotion based on the behavior of both, which would enable a better understanding of couples in research, therapy, and the real world.

Multimodal Dataset of Social Skills Training in Natural Conversational Setting

Social Skills Training (SST) is commonly used in psychiatric rehabilitation programs to improve social skills. It is especially effective for people who have social difficulties related to mental illnesses or developmental difficulties. Previous studies revealed several communication characteristics in Schizophrenia and Autism Spectrum Disorder. However, a few pieces of research have been conducted in natural conversational environments with computational features since automatic capture and analysis are difficult in natural settings. Even if the natural data collection is difficult, the data clearly have much better potential to identify the real communication characteristics of people with mental difficulties and the interaction differences between participants and trainers. Therefore, we collected a one-on-one SST multimodal dataset to investigate and automatically capture natural characteristics expressed by people who suffer from such mental difficulties as Schizophrenia or Autism Spectrum Disorder. To validate the potential of the dataset, using partially annotated data, we trained a classifier for Schizophrenia and healthy control with audio-visual features. We achieved over 85% accuracy, precision, recall, and f1-score in the classification task using only natural interaction data, instead of data captured in the specific tasks designed for clinical assessments.

Multimodal Analysis and Synthesis for Conversational Research

Three research threads in multimodal analysis and synthesis research are interesting and relevant. The first thread involves technologies for off-line multimodal assessment and feedback in conversational and interviewing settings. While nonverbal and verbal assessment is important from the interviewer’s view-point, feedback is important for the candidate. Several notable developments in modeling and HCI have taken place in multimodal analysis literature. The second research thread concerns technologies for real-time multimodal conversational agents and applications. Multimodal analysis enables understanding user emotion and state, multimodal dialog enables making use of this information along with spoken text and subsequently generate the surface text along with the appropriate state for the virtual agent, and finally the multimodal synthesis module generates suitable nonverbal behavior and prosody for the agent reply. These systems can find applications in healthcare or education, where information can be solicited from the user and a simple task-oriented conversation can be accomplished. Innovations for quickly customizing the avatar appearance and animations are also emerging. The final research thread involves Generative Adversarial Networks (GAN) based multimodal analysis and synthesis systems to perform controlled conversational research. GANs can generate human centered images and videos, and hence enabling study of appearance and behavior manipulation, which enables controlled conversational experiments.

The Point of Action where Cognitive Behavioral Therapy is Effective

Cognitive behavioral therapy (CBT) is a treatment method that eases feelings and reduces stress by working on the way of thinking and receiving (cognition) of things, and also on some behavior [1]. This is one of the effective psychotherapy performed for depression and anxiety disorders in the actual clinical situation of psychiatry. CBT includes cognitive reconstruction, behavioral activation, and problem-solving techniques. What is the point of action of CBT as a treatment?

Creatures gather various information from their surroundings and recognize them, that is, process them and connect them to appropriate actions. In the process of the evolution, in order to obtain rapid action, we acquired feelings like an instant “signal system” that indicate the result of cognition, and put the process up to cognition below consciousness. In other words, the “signal system” i.e. feelings lights up instantly, and the actions that accompany it can be swiftly taken. However, if the information gathering that leads to cognition is not successful, it is assumed that cognition would be false and that it would lead to false “signals,” that is, false feelings. It is assumed that information is narrowed in depression and cause a depressed mood. By optimizing this information gathering process, CBT guides correct cognition and optimizes feelings. This is the point of action of CBT.

Our research team is working on the development of automated CBT using AI based on the point of action of this CBT [2, 3].

SESSION: WOCBU'21 Workshop

Automatic analysis of infant engagement during play: An end-to-end learning and Explainable AI pilot experiment

Infant engagement during play is an active area of research, related to the development of cognition. Automatic detection of engagement could benefit the research process, but existing techniques used for automatic affect detection are unsuitable for this scenario, since they rely on the automatic extraction of facial and postural features trained on clear video capture of adults. This study shows that end-to-end Deep Learning methods can successfully detect engagement of infants, without the need of clear facial video, when trained for a specific interaction task. It further shows that attention mapping techniques can provide explainability, thereby enabling trust and insight into a model’s reasoning process.

Recording the Speech of Children with Atypical Development: Peculiarities and Perspectives

The paper considers the possibility of using the unified speech recording protocol for research of speech in typically and atypically developing children. The research protocol includes using the model situations: dialogue, repetition, picture description, and playing. The peculiarities of speech recording in model situations for children with atypical development are described. The data on the speech of children with autism spectrum disorders, Down syndrome, and intellectual disabilities obtained in model situations are presented. The perspectives of future work taking into account creating an interactive computer program - a virtual assistant (“friend”) to avoid the contribution of individual characteristics of the experimenter and parents in the model situations are discussed.

Measuring Frequency of Child-directed WH-Question Words for Alternate Preschool Locations using Speech Recognition and Location Tracking Technologies

Speech and language development in children are crucial for ensuring effective skills in their long-term learning ability. A child’s vocabulary size at the time of entry into kindergarten is an early indicator of their learning ability to read and potential long-term success in school. The preschool classroom is thus a promising venue for assessing growth in young children by measuring their interactions with teachers as well as classmates. However, to date limited studies have explored such naturalistic audio communications. Automatic Speech Recognition (ASR) technologies provide an opportunity for ’Early Childhood’ researchers to obtain knowledge through automatic analysis of naturalistic classroom recordings in measuring such interactions. For this purpose, 208 hours of audio recordings across 48 daylong sessions are collected in a childcare learning center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. Approximately 29 hours of adult speech and 26 hours of child speech is segmented using manual transcriptions provided by CRSS transcription team. Traditional as well as End-to-End ASR models are trained on adult/child speech data subset. Factorized Time Delay Neural Network provides a best Word-Error-Rate (WER) of 35.05% on the adult subset of the test set. End-to-End transformer models achieve 63.5% WER on the child subset of the test data. Next, bar plots demonstrating the frequency of WH-question words in Science vs. Reading activity areas of the preschool are presented for sessions in the test set. It is suggested that learning spaces could be configured to encourage greater adult-child conversational engagement given such speech/audio assessment strategies.