This paper is about the automatic recognition of head movements in videos of face-to-face dyadic conversations. We present an approach where recognition of head movements is casted as a multimodal frame classification problem based on visual and acoustic features. The visual features include velocity, acceleration, and jerk values associated with head movements, while the acoustic ones are pitch and intensity measurements from the co-occuring speech. We present the results obtained by training and testing a number of classifiers on manually annotated data from two conversations. The best performing classifier, a Multilayer Perceptron trained using all the features, obtains 0.75 accuracy and outperforms the mono-modal baseline classifier.
This study presents a prediction model of a speaker's willingness level in human-robot interview interaction by using multimodal features (i.e., verbal, audio, and visual). We collected a novel multimodal interaction corpus, including two types of annotation data sets of willingness. A binary classification task of the willingness level (high or low) was implemented to evaluate the proposed multimodal prediction model. We obtained the best classification accuracy (i.e., 0.6) using the random forest model with audio and motion features. The difference between best accuracy (i.e., 0.6) and coder's recognition accuracy (i.e., 0.73) was 0.13.
Sarcasm is a common feature of user interaction on social networking sites. Sarcasm differs with typical communication in alignment of literal meaning with intended meaning. Humans can recognize sarcasm from sufficient context information including from the various contents available on SNS. Existing literature mainly uses text data to detect sarcasm; though, a few recent studies propose to use image data. To date, no study has focused on user interaction pattern as a source of context information for detecting sarcasm. In this paper, we present a supervised machine learning based approach focusing on both contents of posts (e.g., text, image) and users' interaction on those posts on Facebook.
Prior efforts to create an autonomous computer system capable of predicting what a human being is thinking or feeling from facial expression data have been largely based on outdated, inaccurate models of how emotions work that rely on many scientifically questionable assumptions. In our research, we are creating an empathetic system that incorporates the latest provable scientific understanding of emotions: that they are constructs of the human mind, rather than universal expressions of distinct internal states. Thus, our system uses a user-dependent method of analysis and relies heavily on contextual information to make predictions about what subjects are experiencing. Our system's accuracy and therefore usefulness are built on provable ground truths that prohibit the drawing of inaccurate conclusions that other systems could too easily make.
Existing stress measurement methods, including cortisol measurement, blood pressure monitoring, and psychometric testing, are invasive, impractical, or intermittent, limiting both clinical and biofeedback utility. Better stress measurement methods are needed for practical, widespread application. For the project ViRST, where we use a Virtual Reality (VR) environment controlled by a speech dialog system to provide chronic pain relief, we designed a novel stress biofeedback system. Our prototype employs an ear-clip Photoplethysmogram (PPG) sensor, an Arduino microcontroller, and a supervised learning algorithm. To acquire a training dataset, we ran stress induction experiments on 10 adult subjects aged 30-58 to track Heart Rate Variability (HRV) metrics and Discrete Wavelet Transform (DWT) coefficients. We trained an AdaBoost ensemble classifier to 93% 4-fold cross-validation accuracy and 93% precision. We outline future work to better suit a VR environment and facilitate additional modes of interaction by simplifying the human interface.
Debates are popular among politicians, journalists and scholars because they are a useful way to foster discussion and argumentation about relevant matters. In these discussions, people try to give a good impression (the immediate effect produced in the mind after a stimulus) by showcasing good skills in oratory and argumentation. We investigate this issue by using data gathered from an audience watching a national election debate and measuring the impression that politicians were giving during it. We then model a multimodal approach for automatically predicting their impression and analyze what modalities are the most important ones for this task. Our results show that the vision modality brings the best results and that fusing modalities at the feature-level is beneficial depending on the setup. The dataset is made publicly available to the community for further research on this topic.
Long-exposure to stress is known to lead to physical and mental health problems. But how can we as individuals track and monitor our stress? Wearables which measure heart variability have been studied to detect stress. Such devices, however, need to be worn all day long and can be expensive. As an alternative, we propose the use of frontal face videos to distinguish between stressful and non-stressful activities. Affordable personal tracking of stress levels could be obtained by analyzing the video stream of inbuilt cameras in laptops. In this work, we present a preliminary analysis of 114 one-hour long videos. During the video, the subjects perform a typing exercise before and after being exposed to a stressor. We performed a binary classification using Random Forest (RF) to distinguish between stressful and non-stressful activities. As features, facial action units (AUs) extracted from each video frame were used. We obtained an average accuracy of over 97% and 50% for subject dependent and subject independent classification, respectively.
This research investigates the effectiveness of speech audio and facial image deformation tools that make conversation participants appear more positive. By conducting an experiment, we revealed that participants' feelings became more positive when using the deformation tools, and this caused conversations to be more active. It was also found that voice deformation was equally effective compared to a combination of voice and facial image deformations. Moreover, participants were more likely to activate the tool as a listener and with a less positive feeling. Finally, we discuss how these findings can be applied to design a remote communication support system that helps to shift a conversation to a positive mood.
We examine if EEG-based cognitive load (CL) estimation is generalizable across the character, spatial pattern, bar graph and pie chart-based visualizations for the n-back task. CL is estimated via two recent approaches: (a) Deep convolutional neural network [2], and (b) Proximal support vector machines [15]. Experiments reveal that CL estimation suffers across visualizations calling for for effective machine learning techniques to benchmark visual interface usability for a given analytic task.
Current two dimensional methods of controlling large numbers of small unmanned aerial systems (sUAS) have limitations such as a human's ability to track, efficiently control, and keep situational awareness on large numbers of sUASs. The 2017 DARPA-sponsored Service Academies Swarm Challenge inspired research producing novel command and control techniques that utilize a virtual reality (VR) environment and a multimodal interface. A swarm commander using this approach can select one sUAS, a sub-swarm, or even the entire swarm and assign that grouping a behavior to be carried out autonomously. This immersive VR environment improves situational awareness, optimizes command and control actions, reduces commander's task load, and ultimately provides a significant advantage over approaches that rely on traditional techniques.
Assessing the social competence of anthropomorphic artificial agents developed to produce engaging social interactions with humans has become of primary importance to effectively compare various appearances and/or behaviours. Here we attempt to objectify the social competence of artificial agents, across different dimensions, using human brain neurophysiology. Whole brain activity is recorded with functional Magnetic Resonance Imaging (fMRI) while participants, naïve to the real purpose of the experiment, discuss either with a human confederate or with an artificial agent, presently the robotic conversational head Furhat controlled with a Wizard of Oz procedure. This allows a direct comparison of local brain responses, not only at the cortical level but also in subcortical structures associated with motivational drive and impossible to investigate with non-invasive neurophysiology techniques such as surface recordings. The present data (n = 21 participants) demonstrates the feasibility of this approach, and results confirm an increased activity in subcortical structures, in particular the amygdala involved in emotional processing and the hypothalamus, known to secrete, among others, the neurohormone oxytocin involved in social bonding, as well as the temporoparietal junction bilaterally involved in the attribution of mental states, when interacting with a human compared to an artificial agent. The reverse contrast revealed more dorsal cortical areas. Altogether, these results support the use of fMRI to objectify the social competence of artificial agents along distinct dimensions.