MCPMD '18- Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data

Full Citation in the ACM Digital Library

SESSION: Cognition in interaction

Predicting group satisfaction in meeting discussions

We address the task of automatically predicting group satisfaction in meetings using acoustic, lexical, and turn-taking features. Participant satisfaction is measured using post-meeting ratings from the AMI corpus. We focus on predicting three aspects of satisfaction: overall satisfaction, participant attention satisfaction, and information overload. All predictions are made at the aggregated group level. In general, we find that combining features across modalities improves prediction performance. However, feature ablation significantly improves performance. Our experiments also show how data-driven methods can be used to explore how different facets of group satisfaction are expressed through different modalities. For example, inclusion of prosodic features improves prediction of attention satisfaction but hinders prediction of overall satisfaction, but the opposite for lexical features. Moreover, feelings of sufficient attention were better reflected by acoustic features than by speaking time, while information overload was better reflected by specific lexical cues and turn-taking patterns. Overall, this study indicates that group affect can be revealed as much by how participants speak, as by what they say.

Multimodal approach for cognitive task performance prediction from body postures, facial expressions and EEG signal

Recent developments in computer vision and the emergence of wearable sensors have opened opportunities for the development of advanced and sophisticated techniques to enable multi-modal user assessment and personalized training which is important in educational, industrial training and rehabilitation applications. They have also paved way for the use of assistive robots to accurately assess human cognitive and physical skills. Assessment and training cannot be generalized as the requirement varies for every person and for every application. The ability of the system to adapt to the individual's needs and performance is essential for its effectiveness. In this paper, the focus is on task performance prediction which is an important parameter to consider for personalization. Several research works focus on how to predict task performance based on physiological and behavioral data. In this work, we follow a multi-modal approach where the system collects information from different modalities to predict performance based on (a) User's emotional state recognized from facial expressions(Behavioral data), (b) User's emotional state from body postures(Behavioral data) (c) task performance from EEG signals (Physiological data) while the person performs a robot-based cognitive task. This multi-modal approach of combining physiological data and behavioral data produces the highest accuracy of 87.5 percent, which outperforms the accuracy of prediction extracted from any single modality. In particular, this approach is useful in finding associations between facial expressions, body postures and brain signals while a person performs a cognitive task.

Workload-driven modulation of mixed-reality robot-human communication

In this work we explore how Augmented Reality annotations can be used as a form of Mixed Reality gesture, how neurophysiological measurements can inform the decision as to whether or not to use such gestures, and whether and how to adapt language when using such gestures. In this paper, we propose a preliminary investigation of how decisions regarding robot-to-human communication modality in mixed reality environments might be made on the basis of humans' perceptual and cognitive states. Specifically, we propose to use brain data acquired with high-density functional near-infrared spectroscopy (fNIRS) to measure the neural correlates of cognitive and emotional states with particular relevance to adaptive human-robot interaction (HRI). In this paper we describe several states of interest that fNIRS is well suited to measure and that have direct implications to HRI adaptations and we leverage a framework developed in our prior work to explore how different neurophysiological measures could inform the selection of different communication strategies. We then describe results from a feasibility experiment where multilabel Convolutional Long Short Term Memory Networks were trained to classify the target mental states of 10 participants and we discuss a research agenda for adaptive human-robot teams based on our findings.

Symptoms of cognitive load in interactions with a dialogue system

Humans adapt their behaviour to the perceived cognitive load of their dialogue partner, for example, delaying non-essential information. We propose that spoken dialogue systems should do the same, particularly in high-stakes scenarios, such as emergency response. In this paper, we provide a summary of the prosodic, turn-taking and other linguistic symptoms of cognitive load analysed in the literature. We then apply these features to a single corpus in the restaurant-finding domain and propose new symptoms that are evidenced through interaction with the dialogue system, including utterance entropy, speech recognition confidence, as well as others based on dialogue acts.

SESSION: Attention

Histogram of oriented velocities for eye movement detection

Research in various fields including psychology, cognition, and medical science deal with eye tracking data to extract information about the intention and cognitive state of a subject. For the extraction of this information, the detection of eye movement types is an important task. Modern eye tracking data is noisy and most of the state-of-the-art algorithms are not developed for all types of eye movements since they are still under research. We propose a novel feature for eye movement detection, which is called histogram of oriented velocities. The construction of the feature is similar to the well known histogram of oriented gradients from computer vision. Since the detector is trained using machine learning, it can always be extended to new eye movement types. We evaluate our feature against the state-of-the-art on publicly available data. The evaluation includes different machine learning approaches such as support vector machines, regression trees, and k nearest neighbors. We evaluate our feature together with the machine learning approaches for different parameter sets. We provide a matlab script for the computation and evaluation as well as an integration in EyeTrace which can be downloaded at

Estimating mental load in passive and active tasks from pupil and gaze changes using bayesian surprise

Eye-based monitoring has been suggested as a means to measure mental load in a non-intrusive way. In most cases, the experiments have been conducted in a setting where the user has been mainly passive. This constraint does not reflect applications where we want to identify mental load of an active user, e.g. during surgery. The main objective of our work is to investigate the potential of an eye tracking device for measuring the mental load in realistic active situations. In our first experiments we calibrate our setup by using a well established passive setup. There, we confirm that our setup can recover reliably eye width in real time, and we can observe the previously reported relationship between pupil width and cognitive load, however, we also observe a very high variance between different test subjects. In a follow up active task experiment, neither pupil width nor eye gaze showed a significant predictive power over workflow disruptions. To address this, we present an approach for estimating the likelihood of workflow disruptions during active fine-motor tasks. Our method combines the eye-based data with the Bayesian Surprise theory and is able to successfully predict user's struggle with correlations of 35% and 75% respectively.

Investigating static and sequential models for intervention-free selection using multimodal data of EEG and eye tracking

Multimodal data is increasingly used in cognitive prediction models to better analyze and predict different user cognitive processes. Classifiers based on such data, however, have different performance characteristics. We discuss in this paper an intervention-free selection task using multimodal data of EEG and eye tracking in three different models. We show that a sequential model, LSTM, is more sensitive but less precise than a static model SVM. Moreover, we introduce a confidence-based Competition-Fusion model using both SVM and LSTM. The fusion model further improves the recall compared to either SVM or LSTM alone, without decreasing precision compared to LSTM. According to the results, we recommend SVM for interactive applications which require minimal false positives (high precision), and recommend LSTM and highly recommend Competition-Fusion Model for applications which handle intervention-free selection requests in an additional post-processing step, requiring higher recall than precision.

Overlooking: the nature of gaze behavior and anomaly detection in expert dentists

The cognitive processes that underly expert decision making in medical image interpretation are crucial to the understanding of what constitutes optimal performance. Often, if an anomaly goes undetected, the exact nature of the false negative is not fully understood. This work looks at 24 experts' performance (true positives and false negatives) during an anomaly detection task for 13 images and the corresponding gaze behavior. By using a drawing and an eye-tracking experimental paradigm, we compared expert target anomaly detection in orthopantomographs (OPTs) against their own gaze behavior. We found there was a relationship between the number of anomalies detected and the anomalies looked at. However, roughly 70% of anomalies that were not explicitly marked in the drawing paradigm were looked at. Therefore, we looked how often an anomaly was glanced at. We found that when not explicitly marked, target anomalies were more often glanced at once or twice. In contrast, when targets were marked, the number of glances was higher. Furthermore, since this behavior was not similar over all images, we attribute these differences to image complexity.

Rule-based learning for eye movement type detection

Eye movements hold information about human perception, intention, and cognitive state. Various algorithms have been proposed to identify and distinguish eye movements, particularly fixations, saccades, and smooth pursuits. A major drawback of existing algorithms is that they rely on accurate and constant sampling rates, error free recordings, and impend straightforward adaptation to new movements, such as microsaccades, since they are designed for certain eye movement detection. We propose a novel rule-based machine learning approach to create detectors on annotated or simulated data. It is capable of learning diverse types of eye movements as well as automatically detecting pupil detection errors in the raw gaze data. Additionally, our approach is capable of using any sampling rate, even with fluctuations. Our approach consists of learning several interdependent thresholds and previous type classifications and combines them into sets of detectors automatically. We evaluated our approach against the state-of-the-art algorithms on publicly available datasets. Our approach is integrated in the newest version of EyeTrace which can be downloaded at

SESSION: Neural and cognitive modeling

Integrating non-invasive neuroimaging and computer log data to improve understanding of cognitive processes

As non-invasive neuroimaging techniques become less expensive and more portable, we have the capability to monitor brain activity during various computer activities. This provides an opportunity to integrate brain data with computer log data to develop models of cognitive processes. These models can be used to continually assess an individual's changing cognitive state and develop adaptive human-computer interfaces. As a step in this direction, we have conducted a study using functional near-infrared spectroscopy (fNIRS) during the Sustained Attention to Response Task (SART) paradigm, which has been used in prior work to elicit mind wandering and to explore response inhibition. The goal with this is to determine whether fNIRS data can be used as a predictor of errors on the task. This would have implications for detecting similar cognitive processes in more realistic tasks, such as using a personal learning environment. Additionally, this study aims to test individual differences by correlating objective behavioral data and subjective self reports with activity in the medial prefrontal cortex (mPFC), associated with the brain's default mode network (DMN). We observed significant differences in the mPFC between periods prior to task error and periods prior to a correct response. These differences were particularly apparent amongst those individuals who performed poorly on the SART task and those who reported drowsiness. In line with previous work, these findings indicate an opportunity to detect and correct attentional shifts in individuals who need it most.

Multimer: validating multimodal, cognitive data in the city: towards a model of how the urban environment influences streetscape users

Multimer is a new technology that aims to provide a data-driven understanding of how humans cognitively and physically experience spatial environments. By multimodally measuring biosensor data to model how the built environment and its uses influence cognitive processes, Multimer aims to help space professionals like architects, workplace strategists, and urban planners make better design interventions. Multimer is perhaps the first spatial technology that collects biosensor data, like brainwave and heart rate data, and analyzes it with both spatiotemporal and neurophysiological tools.

The Multimer mobile app can record data from several kinds of commonly available, inexpensive, wearable sensors, including EEG, ECG, pedometer, accelerometer, and gyroscope modules. The Multimer app also records user-entered information via its user interface and micro-surveys, then also combines all this data with a user's geo-location using GPS, beacons, and other location tools. Multimer's study platform displays all of this data in real-time at the individual and aggregate level. Multimer also validates the data by comparing the collected sensor and sentiment data in spatiotemporal contexts, and then it integrates the collected data with other data sets such as citizen reports, traffic data, and city amenities to provide actionable insights towards the evaluation and redesign of sites and spaces.

This report presents preliminary results from the data validation process for a Multimer study of 101 subjects in New York City from August to October 2017. Ultimately, the aim of this study is to prototype a replicable, scalable model of how the built environment and the movement of traffic influence the neurophysiological state of pedestrians, cyclists, and drivers.

The role of emotion in problem solving: first results from observing chess

In this paper we present results from recent experiments that suggest that chess players associate emotions to game situations and reactively use these associations to guide search for planning and problem solving. We report on a pilot experiment with multi-modal observation of human experts engaged in solving challenging problems in Chess. Our results confirm that cognitive processes have observable correlates in displays of emotion and fixation, and that these displays can be used to evaluate models of cognitive processes. They also revealed an unexpected observation of rapid changes in emotion as players attempt to solve challenging problems. In this paper, we propose a cognitive model to explain our observations, and describe initial results from a second experiment designed to test this model.

Discovering digital representations for remembered episodes from lifelog data

Combining self-reports in which individuals reflect on their thoughts and feelings (Experience Samples) with sensor data collected via ubiquitous monitoring can provide researchers and applications with detailed insights about human behavior and psychology. However, meaningfully associating these two sources of data with each other is difficult: while it is natural for human beings to reflect on their experience in terms of remembered episodes, it is an open challenge to retrace this subjective organization in sensor data referencing objective time.

Lifelogging is a specific approach to the ubiquitous monitoring of individuals that can contribute to overcoming this recollection gap. It strives to create a comprehensive timeline of semantic annotations that reflect the impressions of the monitored person from his or her own subjective point-of-view.

In this paper, we describe a novel approach for processing such lifelogs to situate remembered experiences in an objective timeline. It involves the computational modeling of individuals' memory processes to estimate segments within a lifelog acting as plausible digital representations for their recollections. We report about an empirical investigation in which we use our approach to discover plausible representations for remembered social interactions between participants in a longitudinal study. In particular, we describe an exploration of the behavior displayed by our model for memory processes in this setting. Finally, we explore the representations discovered for this study and discuss insights that might be gained from them.