IUI '22: 27th International Conference on Intelligent User Interfaces

Full Citation in the ACM Digital Library

SESSION: Invited Keynotes

Employing Social Media to Improve Mental Health: Pitfalls, Lessons Learned, and the Next Frontier

Social media data is being increasingly used to computationally learn about and infer the mental health states of individuals and populations. Despite being touted as a powerful means to shape interventions and impact mental health recovery, little do we understand about the theoretical, domain, and psychometric validity of this novel information source, or its underlying biases, when appropriated to augment conventionally gathered data, such as surveys and verbal self-reports. This talk presents a critical analytic perspective on the pitfalls of social media signals of mental health, especially when they are derived from “proxy” diagnostic indicators, often removed from the real-world context in which they are likely to be used. Then, to overcome these pitfalls, this talk presents results from two case studies, where machine learning algorithms to glean mental health insights from social media were developed in a context-sensitive and human-centered way, in collaboration with domain experts and stakeholders. The first of these case studies, a collaboration with a health provider, focuses on the individual-perspective, and reveals the ability and implications of using social media data of consented schizophrenia patients to forecast relapse and support clinical decision-making. Scaling up to populations, in collaboration with a federal organization and towards influencing public health policy, the second case study seeks to forecast nationwide rates of suicide fatalities using social media signals, in conjunction with health services data. The talk concludes with discussions of the path forward, emphasizing the need for a collaborative, multi-disciplinary research agenda while realizing the potential of social media data and machine learning in mental health – one that incorporates methodological rigor, ethics, and accountability, all at once.

From Social to Prosocial Machines: A New Challenge for AI

Throughout the past few years, artificial intelligence (AI) has become increasingly more present in our daily lives. A myriad of settings became the stage for AI applications, such as factories, roads, houses, hospitals and even schools. Given these new contexts, AI-powered machines must now place the human at the centre, and be designed to interact with humans in a natural way: AI is becoming social.

But such a diverse use of AI also fosters change, especially in the way we behave and how with cooperate with each other and with machines. It is therefore important to reflect upon the impact that AI may have on humans’ societies, and consider its effects on supporting more collaboration, social action, and prosocial behavior. Prosocial behavior occurs when people and agents perform costly actions that benefit others. Acts such as helping others voluntarily, donating to charity, providing information or sharing resources, are all forms of prosocial behavior. Humans are inherently prosocial, and attributes, such as altruism or empathy, that affect decision-making and cooperation, are essential ingredients to more just and positive societies. However, the view of human decision-making prevalent in the design of AI is based on the homo economicus principle, where utility maximization and selfishness are the backbone for modeling behavior in autonomous behavior.

In this talk I will be challenging this view and explore how to create AI (agents) that places prosocial behavior at the core, and while engaged in human settings, cultivates cooperation and fosters people into contributing for the social good.

I will imagine a future where prosocial machines can be a reality and present three case studies from different areas: prosocial robotics; prosocial games, and social simulation. These simple examples illustrate how AI can play an active role by contributing to a kinder society.

Provably Beneficial Artificial Intelligence

As AI advances in capabilities and moves into the real world, its potential to benefit humanity seems limitless. Yet we see serious problems including racial and gender bias, manipulation by social media, and an arms race in lethal autonomous weapons. Looking further ahead, Alan Turing predicted the eventual loss of human control over machines that exceed human capabilities. I will argue that Turing was right to express concern but wrong to think that doom is inevitable. Instead, we need to develop a new kind of AI that is provably beneficial to humans.

SESSION: Session 1: Health, Well-being and Accessibility

Towards Efficient Annotations for a Human-AI Collaborative, Clinical Decision Support System: A Case Study on Physical Stroke Rehabilitation Assessment

Artificial intelligence (AI) and machine learning (ML) algorithms are increasingly being explored to support various decision-making tasks in health (e.g. rehabilitation assessment). However, the development of such AI/ML-based decision support systems is challenging due to the expensive process to collect an annotated dataset. In this paper, we describe the development process of a human-AI collaborative, clinical decision support system that augments an ML model with a rule-based (RB) model from domain experts. We conducted its empirical evaluation in the context of assessing physical stroke rehabilitation with the dataset of three exercises from 15 post-stroke survivors and therapists. Our results bring new insights on the efficient development and annotations of a decision support system: when an annotated dataset is not available initially, the RB model can be used to assess post-stroke survivor’s quality of motion and identify samples with low confidence scores to support efficient annotations for training an ML model. Specifically, our system requires only 22 - 33% of annotations from therapists to train an ML model that achieves equally good performance with an ML model with all annotations from a therapist. Our work discusses the values of a human-AI collaborative approach for effectively collecting an annotated dataset and supporting a complex decision-making task.

Colorbo: Envisioned Mandala Coloringthrough Human-AI Collaboration

Mandala coloring is popular among many people, from children to adults, and many studies have revealed its benefits in mental well-being. However, our preliminary study results reveal difficulties in mandala coloring tasks, such as selecting harmonious colors/areas and envisioning how each selection affects the final output. This paper presents Colorbo, an interactive system based on human-AI collaboration to envision mandala coloring. Colorbo and its user colorize a mandala individually by watching each other work. The user shows a colored mandala in progress, and Colorbo fills in the remaining areas by analyzing the patterns and color combinations of the user’s image. Colorbo then projects the complete mandala onto the paper the user is colorizing, and the user continues coloring by envisioning the outcome based on images from Colorbo. We conducted a within-subject study to investigate the effectiveness of Colorbo. Our quantitative and qualitative analysis results show the positive experiences of the participants, concerns regarding the coloring behavior with Colorbo, and their preferred projection method for envisioning a mandala. Finally, based on the findings, we discuss the design implications for human-AI collaboration in the area of art.

The ”Artificial” Colleague: Evaluation of Work Satisfaction in Collaboration with Non-human Coworkers

The advance of “artificial intelligence”-(AI)-based technologies has the potential to transform work tremendously. Work is a major part of life and, thus, its meaningfulness or lack thereof will impact overall well-being. Previous research investigated human-AI collaboration at work mostly with a focus on performance. However, little attention is given to how collaboration with AI influences the meaningfulness of work and job satisfaction. In this paper, we present an online experiment to compare the perception of meaningfulness and relationship to the collaborator across different task distributions and collaborators (human/AI). Our results show that working with a human is more motivating and meaningful compared to working with an AI independent of the task. Moreover, the AI is more often viewed as a subordinate, while the human is perceived as a teammate. These results provide preliminary implications for the design of collaboration with AI that consider job satisfaction.

Crowdsourcing Thumbnail Captions via Time-Constrained Methods

Speech interfaces, such as personal assistants and screen readers, employ captions to allow users to consume images; however, there is typically only one caption available per image, which may not be adequate for all settings (e.g., browsing large quantities of images). Longer captions require more time to consume, whereas shorter captions may hinder a user’s ability to fully understand the image’s content. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions, which allow individuals to understand visual content in greater detail. We consider text-based and time-constrained methods to collect descriptions at these two levels of detail, and find that a time-constrained method is most effective for collecting thumbnail captions while preserving caption accuracy. We evaluate our collected captions along three human-rated axes—correctness, fluency, and level of detail—and discuss the potential for model-based metrics to perform automatic evaluation.

InSupport: Proxy Interface for Enabling Efficient Non-Visual Interaction with Web Data Records

Interaction with web data records typically involves accessing auxiliary webpage segments such as filters, sort options, search form, and multi-page links. As these segments are usually scattered all across the screen, it is arduous and tedious for blind users who rely on screen readers to access the segments, given that content navigation with screen readers is predominantly one-dimensional, despite the available support for skipping content via either special keyboard shortcuts or selective navigation. The extant techniques to overcome inefficient web screen reader interaction have mostly focused on general web content navigation, and as such they provide little to no support for data record-specific interaction activities such as filtering and sorting – activities that are equally important for enabling quick and easy access to the desired data records. To fill this void, we present InSupport, a browser extension that: (i) employs custom-built machine learning models to automatically extract auxiliary segments on any webpage containing data records, and (ii) provides an instantly accessible proxy one-stop interface for easily navigating the extracted segments using basic screen reader shortcuts. An evaluation study with 14 blind participants showed significant improvement in usability with InSupport, driven by increased reduction in interaction time and the number of key presses, compared to state-of-the-art solutions.

Opportunities for Human-AI Collaboration in Remote Sighted Assistance

Remote sighted assistance (RSA) has emerged as a conversational assistive technology for people with visual impairments (VI), where remote sighted agents provide realtime navigational assistance to users with visual impairments via video-chat-like communication. In this paper, we conducted a literature review and interviewed 12 RSA users to comprehensively understand technical and navigational challenges in RSA for both the agents and users. Technical challenges are organized into four categories: agents’ difficulties in orienting and localizing the users; acquiring the users’ surroundings and detecting obstacles; delivering information and understanding user-specific situations; and coping with a poor network connection. Navigational challenges are presented in 15 real-world scenarios (8 outdoor, 7 indoor) for the users. Prior work indicates that computer vision (CV) technologies, especially interactive 3D maps and realtime localization, can address a subset of these challenges. However, we argue that addressing the full spectrum of these challenges warrants new development in Human-CV collaboration, which we formalize as five emerging problems: making object recognition and obstacle avoidance algorithms blind-aware; localizing users under poor networks; recognizing digital content on LCD screens; recognizing texts on irregular surfaces; and predicting the trajectory of out-of-frame pedestrians or objects. Addressing these problems can advance computer vision research and usher into the next generation of RSA service.

FitNibble: A Field Study to Evaluate the Utility and Usability of Automatic Diet Monitoring in Food Journaling Using an Eyeglasses-based Wearable

The ultimate goal of automatic diet monitoring systems (ADM) is to make food journaling as easy as counting steps with a smartwatch. To achieve this goal, it is essential to understand the utility and usability of ADM systems in real-world settings. However, this has been challenging since many ADM systems perform poorly outside the research labs. Therefore, one of the main focuses of ADM research has been on improving ecological validity. This paper presents an evaluation of ADM’s utility and usability using an end-to-end system, FitNibble. FitNibble is robust to many challenges that real-world settings pose and provides just-in-time notifications to remind users to journal as soon as they start eating. We conducted a long-term field study to compare traditional self-report journaling and journaling with ADM in this evaluation. We recruited 13 participants from various backgrounds and asked them to try each journaling method for nine days. Our results showed that FitNibble improved adherence by significantly reducing the number of missed events (19.6% improvement, p =.0132). Results have shown that participants were highly dependent on FitNibble in maintaining their journals. Participants also reported increased awareness of their dietary patterns, especially with snacking. All these results highlight the potential of ADM in improving the food journaling experience.

SESSION: Session 2: Recommender Systems and Decision-Making

Explaining Recommendations in E-Learning:Effects on Adolescents’ Trust

In the scope of explainable artificial intelligence, explanation techniques are heavily studied to increase trust in recommender systems. However, studies on explaining recommendations typically target adults in e-commerce or media contexts; e-learning has received less research attention. To address these limits, we investigated how explanations affect adolescents’ initial trust in an e-learning platform that recommends mathematics exercises with collaborative filtering. In a randomized controlled experiment with 37 adolescents, we compared real explanations with placebo and no explanations. Our results show that real explanations significantly increased initial trust when trust was measured as a multidimensional construct of competence, benevolence, integrity, intention to return, and perceived transparency. Yet, this result did not hold when trust was measured one-dimensionally. Furthermore, not all adolescents attached equal importance to explanations and trust scores were high overall. These findings underline the need to tailor explanations and suggest that dynamically learned factors may be more important than explanations for building initial trust. To conclude, we thus reflect upon the need for explanations and recommendations in e-learning in low-stakes and high-stakes situations.

Recommendations as Challenges: Estimating Required Effort and User Ability for Health Behavior Change Recommendations

Recommender Systems use implicit and explicit user feedback to recommend desired products or items online. When the recommendation item is a task or behavior change activity, several variables, such as the difficulty of the task and users’ ability to achieve it, in addition to user preferences and needs, determine the suitability of the recommendations. This paper focuses on how user ability and task difficulty concepts can be integrated into the recommendation process to personalize health activity recommendations. To this end, we compare five approaches, some borrowed from the sports and gaming world, and explore their application, advantages, and drawbacks. Through a study of two weeks, we obtained a suitable dataset to investigate how these algorithms can be used for a health recommender system (HRS) and which one is the most appropriate choice for an online HRS in terms of characteristics and flexibility required for behavior change related tailoring. We compared this choice with a baseline algorithm as part of a fully functional HRS to assess the feasibility and impact of integrating the user ability and required effort concepts on the user engagement with the recommendations in an online longitudinal study of two weeks. The results overall suggest that such integration is effective, and in addition to realizing health behavior change requirements, it improves user engagement with the recommendations.

TastePaths: Enabling Deeper Exploration and Understanding of Personal Preferences in Recommender Systems

Recommender systems are ubiquitous and influence the information we consume daily by helping us navigate vast catalogs of information like music databases. However, their linear approach of surfacing content in ranked lists limits their ability to help us grow and understand our personal preferences. In this paper, we study how we can better support users in exploring a novel space, specifically focusing on music genres. Informed by interviews with expert music listeners, we developed TastePaths: an interactive web tool that helps users explore an overview of the genre-space via a graph of connected artists. We conducted a comparative user study with 16 participants where each of them used a personalized version of TastePaths (built with a set of artists the user listens to frequently) and a non-personalized one (based on a set of the most popular artists in a genre). We find that participants employed various strategies to explore the space. Overall, they greatly preferred the personalized version as it helped anchor their exploration and provided recommendations that were more compatible with their personal taste. In addition to that, TastePaths helped participants specify and articulate their interest in the genre and gave them a better understanding of the system’s organization of music. Based on our findings, we discuss opportunities and challenges for incorporating more control and expressive feedback in recommendation systems to help users explore spaces beyond their immediate interests and improve these systems’ underlying algorithms.

SESSION: Session 3: Explainable AI (XAI) 1

Do Humans Prefer Debiased AI Algorithms? A Case Study in Career Recommendation

Currently, there is a surge of interest in fair Artificial Intelligence (AI) and Machine Learning (ML) research which aims to mitigate discriminatory bias in AI algorithms, e.g. along lines of gender, age, and race. While most research in this domain focuses on developing fair AI algorithms, in this work, we examine the challenges which arise when human- fair-AI interact. Our results show that due to an apparent conflict between human preferences and fairness, a fair AI algorithm on its own may be insufficient to achieve its intended results in the real world. Using college major recommendation as a case study, we build a fair AI recommender by employing gender debiasing machine learning techniques. Our offline evaluation showed that the debiased recommender makes fairer and more accurate college major recommendations. Nevertheless, an online user study of more than 200 college students revealed that participants on average prefer the original biased system over the debiased system. Specifically, we found that the perceived gender disparity associated with a college major is a determining factor for the acceptance of a recommendation. In other words, our results demonstrate we cannot fully address the gender bias issue in AI recommendations without addressing the gender bias in humans. They also highlight the urgent need to extend the current scope of fair AI research from narrowly focusing on debiasing AI algorithms to including new persuasion and bias explanation technologies in order to achieve intended societal impacts.

Exploring the Effects of Machine Learning Literacy Interventions on Laypeople’s Reliance on Machine Learning Models

Today, machine learning (ML) technologies have penetrated almost every aspect of people’s lives, yet public understandings of these technologies are often limited. This highlights the urgent need of designing effective methods to increase people’s machine learning literacy, as the lack of relevant knowledge may result in people’s inappropriate usage of machine learning technologies. In this paper, we focus on an ML-assisted decision-making setting and conduct a human-subject randomized experiment to explore how providing different types of user tutorials as the machine learning literacy interventions can influence laypeople’s reliance on ML models, on both in-distribution and out-of-distribution examples. We vary the existence, interactivity and scope of the user tutorial across different treatments in our experiment. Our results show that user tutorials, when presented in appropriate forms, can help some people rely on ML models more appropriately. For example, for those individuals who have relatively high ability in solving the decision-making task themselves, receiving a user tutorial that is interactive and addresses the specific ML model to be used allows them to reduce their over-reliance on the ML model when they could outperform the model. In contrast, low-performing individuals’ reliance on the ML model is not affected by the presence or the type of user tutorial. Finally, we also find that people perceive the interactive tutorial to be more understandable and slightly more useful. We conclude by discussing the design implications of our study.

Explaining Call Recommendations in Nursing Homes: a User-Centered Design Approach for Interacting with Knowledge-Based Health Decision Support Systems

Recommender systems are increasingly used in high-risk application domains, including healthcare. It has been shown that explanations are crucial in this context to support decision-making. This paper explores how to explain call recommendations to nursing home staff, providing insights into call priority, notifications, and resident information. We present the design and implementation of a recommender engine and a mobile application designed to support call recommendations and explain these recommendations that may contribute to residents’ safety and quality of care. More specifically, we report on the results of a user-centered design approach with residents (N=12) and healthcare professionals (N=4), and a final evaluation (N=12) after four months of deployment. The results show that our design approach provides a valuable tool for more accurate and efficient decision-making. The overall system encourages nursing home staff to provide feedback and annotate, resulting in more confidence in the system. We discuss usability issues, challenges, and reflections to be considered in future health recommender systems.

Deep Learning Uncertainty in Machine Teaching

Machine Learning models can output confident but incorrect predictions. To address this problem, ML researchers use various techniques to reliably estimate ML uncertainty, usually performed on controlled benchmarks once the model has been trained. We explore how the two types of uncertainty—aleatoric and epistemic—can help non-expert users understand the strengths and weaknesses of a classifier in an interactive setting. We are interested in users’ perception of the difference between aleatoric and epistemic uncertainty and their use to teach and understand the classifier. We conducted an experiment where non-experts train a classifier to recognize card images, and are tested on their ability to predict classifier outcomes. Participants who used either larger or more varied training sets significantly improved their understanding of uncertainty, both epistemic or aleatoric. However, participants who relied on the uncertainty measure to guide their choice of training data did not significantly improve classifier training, nor were they better able to guess the classifier outcome. We identified three specific situations where participants successfully identified the difference between aleatoric and epistemic uncertainty: placing a card in the exact same position as a training card; placing different cards next to each other; and placing a non-card, such as their hand, next to or on top of a card. We discuss our methodology for estimating uncertainty for Interactive Machine Learning systems and question the need for two-level uncertainty in Machine Teaching.

How Do People Rank Multiple Mutant Agents?

Faced with several AI-powered sequential decision-making systems, how might someone choose on which to rely? For example, imagine car buyer Blair shopping for a self-driving car, or developer Dillon trying to choose an appropriate ML model to use in their application. Their first choice might be infeasible (i.e., too expensive in money or execution time), so they may need to select their second or third choice. To address this question, this paper presents: 1) Explanation Resolution, a quantifiable direct measurement concept; 2) a new XAI empirical task to measure explanations: “the Ranking Task”; and 3) a new strategy for inducing controllable agent variations—Mutant Agent Generation. In support of those main contributions, it also presents 4) novel explanations for sequential decision-making agents; 5) an adaptation to the AAR/AI assessment process; and 6) a qualitative study around these devices with 10 participants to investigate how they performed the Ranking Task on our mutant agents, using our explanations, and structured by AAR/AI. From an XAI researcher perspective, just as mutation testing can be applied to any code, mutant agent generation can be applied to essentially any neural network for which one wants to evaluate an assessment process or explanation type. As to an XAI user’s perspective, the participants ranked the agents well overall, but showed the importance of high explanation resolution for close differences between agents. The participants also revealed the importance of supporting a wide diversity of explanation diets and agent “test selection” strategies.

Investigating Explainability of Generative AI for Code through Scenario-based Design

What does it mean for a generative AI model to be explainable? The emergent discipline of explainable AI (XAI) has made great strides in helping people understand discriminative models. Less attention has been paid to generative models that produce artifacts, rather than decisions, as output. Meanwhile, generative AI (GenAI) technologies are maturing and being applied to application domains such as software engineering. Using scenario-based design and question-driven XAI design approaches, we explore users’ explainability needs for GenAI in three software engineering use cases: natural language to code, code translation, and code auto-completion. We conducted 9 workshops with 43 software engineers in which real examples from state-of-the-art generative AI models were used to elicit users’ explainability needs. Drawing from prior work, we also propose 4 types of XAI features for GenAI for code and gathered additional design ideas from participants. Our work explores explainability needs for GenAI for code and demonstrates how human-centered approaches can drive the technical development of XAI in novel domains.

SESSION: Session 4: Alternative Input Modes

Emotion Recognition in Conversations Using Brain and Physiological Signals

Emotions are complicated psycho-physiological processes that are related to numerous external and internal changes in the body. They play an essential role in human-human interaction and can be important for human-machine interfaces. Automatically recognizing emotions in conversation could be applied in many application domains like health-care, education, social interactions, entertainment, and more. Facial expressions, speech, and body gestures are primary cues that have been widely used for recognizing emotions in conversation. However, these cues can be ineffective as they cannot reveal underlying emotions when people involuntarily or deliberately conceal their emotions. Researchers have shown that analyzing brain activity and physiological signals can lead to more reliable emotion recognition since they generally cannot be controlled. However, these body responses in emotional situations have been rarely explored in interactive tasks like conversations. This paper explores and discusses the performance and challenges of using brain activity and other physiological signals in recognizing emotions in a face-to-face conversation. We present an experimental setup for stimulating spontaneous emotions using a face-to-face conversation and creating a dataset of the brain and physiological activity. We then describe our analysis strategies for recognizing emotions using Electroencephalography (EEG), Photoplethysmography (PPG), and Galvanic Skin Response (GSR) signals in subject-dependent and subject-independent approaches. Finally, we describe new directions for future research in conversational emotion recognition and the limitations and challenges of our approach.

Differentiating Endogenous and Exogenous Attention Shifts Based on Fixation-Related Potentials

Attentional shifts can occur voluntarily (endogenous control) or reflexively (exogenous control). Previous studies have shown that the neural mechanisms underlying these shifts produce different activity patterns in the brain. Changes in visual-spatial attention are usually accompanied by eye movements and a fixation on the new center of attention. In this study, we analyze the fixation-related potentials in electroencephalographic recordings of 10 participants during computer screen-based viewing tasks. During task performance, we presented salient visual distractors to evoke reflexive attention shifts. Surrounding each fixation, 0.7-second data windows were extracted and labeled as “endogenous” or “exogenous”. Averaged over all participants, the balanced classification accuracy using a person-dependent Linear Discriminant Analysis reached 59.84%. In a leave-one-participant-out approach, the average classification accuracy reached 58.48%. Differentiating attention shifts, based on fixation-related potentials, could be used to deepen the understanding of human viewing behavior or as a Brain-Computer Interface for attention-aware user interface adaptations.

Brainwave-Augmented Eye Tracker: High-Frequency SSVEPs Improves Camera-Based Eye Tracking Accuracy

In this work, we leverage neural mechanisms of visual attention to improve the accuracy of a commercial eye tracker through the analysis of electroencephalography (EEG) waves. Gaze targets were rendered in a computer screen with imperceptible flickering stimuli (≥ 40Hz) that elicited attention-modulated steady-state visual evoked potentials (SSVEPs). Our hybrid system combines EEG and eye-tracking modalities to overcome accuracy limitations of the gaze-tracker alone. We integrate EEG and gaze data to efficiently exploit their complementary strengths driving a Bayesian probabilistic decoder that estimates the target gazed by the user. Our system’s performance was analyzed across the screen with varying target sizes, spacings and dataset epoch lengths, using data from 10 subjects. Overall, our hybrid approach improves the classification accuracy of the eye tracker alone for all target parameters and dataset epoch lengths in 11 units on average. The system shows a larger impact at peripheral screen regions where performance enhancement is maximal, reaching improvements of over 45 units. The findings of this work demonstrate that the intrinsic accuracy limitations of camera-based eye-trackers can be corrected with the integration of EEG data, and opens opportunities for gaze tracking applications with higher target granularity.

Robust and Deployable Gesture Recognition for Smartwatches

Gesture recognition on smartwatches is challenging not only due to resource constraints but also due to the dynamically changing conditions of users. It is currently an open problem how to engineer gesture recognisers that are robust and yet deployable on smartwatches. Recent research has found that common everyday events, such as a user removing and wearing their smartwatch again, can deteriorate recognition accuracy significantly. In this paper, we suggest that prior understanding of causes behind everyday variability and false positives should be exploited in the development of recognisers. To this end, first, we present a data collection method that aims at diversifying gesture data in a representative way, in which users are taken through experimental conditions that resemble known causes of variability (e.g., walking while gesturing) and are asked to produce deliberately varied, but realistic gestures. Secondly, we review known approaches in machine learning for recogniser design on constrained hardware. We propose convolution-based network variations for classifying raw sensor data, achieving greater than 98% accuracy reliably under both individual and situational variations where previous approaches have reported significant performance deterioration. This performance is achieved with a model that is two orders of magnitude less complex than previous state-of-the-art models. Our work suggests that deployable and robust recognition is feasible but requires systematic efforts in data collection and network design to address known causes of gesture variability.

Show of Hands: Leveraging Hand Gestural Cues in Virtual Meetings for Intelligent Impromptu Polling Interactions

Increased virtual meeting software usage has allowed people to meet remotely in a more seamless fashion. However, compared to in-person meetings, valuable interaction cues such as impromptu group polling are less optimally executed due to increased difficulty in gauging remote participants, while also requiring prior meeting setup for automated counting with built-in polling tools. We propose a novel intelligent user interface approach for virtual meeting software that supports impromptu polling interactions by leveraging real-time hand gesture recognition and video filter feedback. We conducted studies to design and evaluate this intuitive gesture-based polling system with visual feedback. Our results demonstrated that our system was able to recognize attendees’ gestures and poll responses with reasonable accuracy, and showed improvements in hosts’ task workload performance. From our findings, our interface informs hosts of valuable results while maintaining organic gestural interaction cues with attendees similar to in-person meetings.

Hazard Notifications for Cyclists: Comparison of Awareness Message Modalities in a Mixed Reality Study

Cycling is an environmentally friendly means of transport with growing popularity. However, there is still potential for increased road safety in the future. We argue that by making assistance systems available to cyclists, accidents could be prevented. In this paper, we focus on potential accidents caused by vehicle doors opening in a cyclist’s path of travel, which can lead to serious injuries to the cyclist. Using a mixed-methods approach, we explored how messages informing about a potentially opening door ahead are perceived and understood regarding usability and intuitiveness in a bicycle simulator study (N=24). We investigated how visual messages, visual messages and auditory icons, and visual and voice messages on a head-mounted device are subjectively perceived. We also assessed our participants’ attitudes toward using such systems and mixed reality simulations for bicycle safety research in general. Our results show that participants preferred visual messages and auditory cues and found these types of notifications more enjoyable than visual messages alone. Furthermore, the results suggest that such a system would be used while cycling. Participants agreed that mixed reality simulation is suitable for testing/evaluating novel support systems and finding initial insights in the first step but confirmed that real-world testing on the road is mandatory nonetheless.

Developing Persona Analytics Towards Persona Science

Much of the reported work on personas suffers from the lack of empirical evidence. To address this issue, we introduce Persona Analytics (PA), a system that tracks how users interact with data-driven personas. PA captures users’ mouse and gaze behavior to measure users’ interaction with algorithmically generated personas and use of system features for an interactive persona system. Measuring these activities grants an understanding of the behaviors of a persona user, required for quantitative measurement of persona use to obtain scientifically valid evidence. Conducting a study with 144 participants, we demonstrate how PA can be deployed for remote user studies during exceptional times when physical user studies are difficult, if not impossible.

SESSION: Session 5: Tools for AI Developers

GridBook: Natural Language Formulas for the Spreadsheet Grid

Writing formulas on the spreadsheet grid is arguably the most widely practiced form of programming. Still, studies highlight the difficulties experienced by end-user programmers when learning and using traditional formulas, especially for slightly complex tasks. The purpose of GridBook is to ease these difficulties by supporting formulas expressed in natural language within the grid; it is the first system to do so.

GridBook builds on a parser utilizing deep learning to understand analysis intents from the natural language input within a spreadsheet cell. GridBook also leverages the spatial context between cells to infer the analysis parameters underspecified in the natural language input. Natural language enables users to analyze data easily and flexibly, to build queries on the results of previous analyses, and to view results intelligibly within the grid—thus taking spreadsheets one step closer to computational notebooks.

We evaluated GridBook via two comparative lab studies, with 20 data analysts new only to GridBook. In our studies, there were no significant differences, in terms of time and cognitive load, in participants’ data analysis using GridBook and spreadsheets; however, data analysis with GridBook was significantly faster than with computational notebooks. Our study uncovers insights into the application of natural language as a special purpose programming language for end-user programming in spreadsheets.

Better Together? An Evaluation of AI-Supported Code Translation

Generative machine learning models have recently been applied to source code, for use cases including translating code between programming languages, creating documentation from code, and auto-completing methods. Yet, state-of-the-art models often produce code that is erroneous or incomplete. In a controlled study with 32 software engineers, we examined whether such imperfect outputs are helpful in the context of Java-to-Python code translation. When aided by the outputs of a code translation model, participants produced code with fewer errors than when working alone. We also examined how the quality and quantity of AI translations affected the work process and quality of outcomes, and observed that providing multiple translations had a larger impact on the translation process than varying the quality of provided translations. Our results tell a complex, nuanced story about the benefits of generative code models and the challenges software engineers face when working with their outputs. Our work motivates the need for intelligent user interfaces that help software engineers effectively work with generative code models in order to understand and evaluate their outputs and achieve superior outcomes to working alone.

ODEN: Live Programming for Neural Network Architecture Editing

In deep learning application development, programmers tend to try different architectures and hyper-parameters until satisfied with the model performance. Nevertheless, program crashes due to tensor shape mismatch prohibit programmers, especially novice programmers, from smoothly going back and forth between neural network (NN) architecture editing and experimentation. We propose to leverage live programming techniques in NN architecture editing with an always-on visualization. When the user edits the program, the visualization can synchronously display tensor states and provide a warning message by continuously executing the program to prevent program crashes during experimentation. We implement the live visualization and integrate it into an IDE called ODEN that seamlessly supports the “edit→experiment→edit→···” repetition. With ODEN, the user can construct the neural network with the live visualization and transits into experimentation to instantly train and test the NN architecture. An exploratory user study is conducted to evaluate the usability, the limitations, and the potential of live visualization in ODEN.

Expressive Communication: Evaluating Developments in Generative Models and Steering Interfaces for Music Creation

There is an increasing interest from ML and HCI communities in empowering creators with better generative models and more intuitive interfaces with which to control them. In music, ML researchers have focused on training models capable of generating pieces with increasing long-range structure and musical coherence, while HCI researchers have separately focused on designing steering interfaces that support user control and ownership. In this study, we investigate how developments in both models and user interfaces are important for empowering co-creation where the goal is to create music that communicates particular imagery or ideas (e.g., as is common for other purposeful tasks in music creation like establishing mood or creating accompanying music for another media). Our study is distinguished in that it measures communication through both composer’s self-reported experiences, and how listeners evaluate this communication through the music. In an evaluation study with 26 composers creating 100+ pieces of music and listeners providing 1000+ head-to-head comparisons, we find that more expressive models and more steerable interfaces are important and complementary ways to make a difference in composers communicating through music and supporting their creative empowerment.

Emblaze: Illuminating Machine Learning Representations through Interactive Comparison of Embedding Spaces

Modern machine learning techniques commonly rely on complex, high-dimensional embedding representations to capture underlying structure in the data and improve performance. In order to characterize model flaws and choose a desirable representation, model builders often need to compare across multiple embedding spaces, a challenging analytical task supported by few existing tools. We first interviewed nine embedding experts in a variety of fields to characterize the diverse challenges they face and techniques they use when analyzing embedding spaces. Informed by these perspectives, we developed a novel system called Emblaze that integrates embedding space comparison within a computational notebook environment. Emblaze uses an animated, interactive scatter plot with a novel Star Trail augmentation to enable visual comparison. It also employs novel neighborhood analysis and clustering procedures to dynamically suggest groups of points with interesting changes between spaces. Through a series of case studies with ML experts, we demonstrate how interactive comparison with Emblaze can help gain new insights into embedding space structure.

Learning User Interface Semantics from Heterogeneous Networks with Multimodal and Positional Attributes

User interfaces (UI) of desktop, web, and mobile applications involve a hierarchy of objects (e.g. applications, screens, view class, and other types of design objects) with multimodal (e.g. textual, visual) and positional (e.g. spatial location, sequence order and hierarchy level) attributes. We can therefore represent a set of application UIs as a heterogeneous network with multimodal and positional attributes. Such a network not only represents how users understand the visual layout of UIs, but also influences how users would interact with applications through these UIs. To model the UI semantics well for different UI annotation, search, and evaluation tasks, this paper proposes the novel Heterogeneous Attention-based Multimodal Positional (HAMP) graph neural network model. HAMP combines graph neural networks with the scaled dot-product attention used in transformers to learn the embeddings of heterogeneous nodes and associated multimodal and positional attributes in a unified manner. HAMP is evaluated with classification and regression tasks conducted on three distinct real-world datasets. Our experiments demonstrate that HAMP significantly out-performs other state-of-the-art models on such tasks. We also report our ablation study results on HAMP.

Understanding Screen Relationships from Screenshots of Smartphone Applications

All graphical user interfaces are comprised of one or more screens that may be shown to the user depending on their interactions. Identifying different screens of an app and understanding the type of changes that happen on the screens is a challenging task that can be applied in many areas including automatic app crawling, playback of app automation macros and large scale app dataset analysis. For example, an automated app crawler needs to understand if the screen it is currently viewing is the same as any previous screen that it has encountered, so it can focus its efforts on portions of the app that it has not yet explored. Moreover, identifying the type of change on the screen, such as whether any dialogues or keyboards have opened or closed, is useful for an automatic crawler to handle such events while crawling. Understanding screen relationships is a difficult task as instances of the same screen may have visual and structural variation, for example due to different content in a database-backed application, scrolling, dialog boxes opening or closing, or content loading delays. At the same time, instances of different screens from the same app may share some similarities in terms of design, structure, and content. This paper uses a dataset of screenshots from more than 1K iPhone applications to train two ML models that understand similarity in different ways: (1) a screen similarity model that combines a UI object detector with a transformer model architecture to recognize instances of the same screen from a collection of screenshots from a single app, and (2) a screen transition model that uses a siamese network architecture to identify both similarity and three types of events that appear in an interaction trace: the keyboard or a dialog box appearing or disappearing, and scrolling. Our models achieve an F1 score of 0.83 on the screen similarity task, improving on comparable baselines, and an average F1 score of 0.71 across all events in the transition task.

SESSION: Session 6: Mobiles and Wearables

Estimating 3D Finger Pose via 2D-3D Fingerprint Matching

Touchscreens have become the primary input devices for smartphones, tablet computers, and other intelligent devices over the past decades. While for the most pervasive commercial devices, only 2D touch positions on the screen are utilized as interaction inputs. To extend the richness of the input vocabulary, some researchers have proposed several innovative interaction techniques, e.g. finger pose. However, due to the low resolution and lacking in information of capacitive images, only two angles, pitch and yaw, are considered in most finger pose estimation algorithms, and the accuracy is not sufficiently high for large scale applications in smartphones. With the rapid development of under-screen fingerprint sensing technology, a new input modality, fingerprint image, for 3D finger pose estimation is available from these fingerprint sensors. In this paper, we propose a finger specific algorithm for estimating 3D finger pose including roll, pitch, and yaw from fingerprint images. 3D finger surface is first reconstructed based on sequential fingerprint images captured in enrollment, and given this 3D surface model, 3D finger pose of a test fingerprint is estimated by matching keypoints between the 2D image and 3D point cloud and minimizing the projection error. The proposed approach is a non-learning algorithm with good generalization ability and robustness in real applications. To evaluate the performance of our method, a dataset of fingerprint images with their corresponding ground truth 3D angles is collected. Experimental results on this dataset demonstrate the effectiveness of introducing reconstructed 3D finger surface shape in 3D finger pose estimation. The average absolute errors of three angles are 10.74 for roll, 8.25 for pitch, and 7.38 for yaw, respectively. Extensive experiments are also conducted to explore the impact of touching area size and gallery size on performance.

EyeSayCorrect: Eye Gaze and Voice Based Hands-free Text Correction for Mobile Devices

Text correction on mobile devices usually requires precise and repetitive manual control. In this paper, we present EyeSayCorrect, an eye gaze and voice based hands-free text correction method for mobile devices. To correct text with EyeSayCorrect, the user first utilizes the gaze location on the screen to select a word, then speaks the new phrase. EyeSayCorrect would then infer the user’s correction intention based on the inputs and the text context. We used a Bayesian approach for determining the selected word given an eye-gaze trajectory. Given each sampling point in an eye-gaze trajectory, the posterior probability of selecting a word is calculated and accumulated. The target word would be selected when its accumulated interest is larger than a threshold. The misspelt words have higher priors. Our user studies showed that using priors for misspelt words reduced the task completion time up to 23.79% and the text selection time up to 40.35%, and EyeSayCorrect is a feasible hands-free text correction method on mobile devices.

Multimodal Driver Referencing: A Comparison of Pointing to Objects Inside and Outside the Vehicle

Advanced in-cabin sensing technologies, especially vision based approaches, have tremendously progressed user interaction inside the vehicle, paving the way for new applications of natural user interaction. Just as humans use multiple modes to communicate with each other, we follow an approach which is characterized by simultaneously using multiple modalities to achieve natural human-machine interaction for a specific task: pointing to or glancing towards objects inside as well as outside the vehicle for deictic references. By tracking the movements of eye-gaze, head and finger, we design a multimodal fusion architecture using a deep neural network to precisely identify the driver’s referencing intent. Additionally, we use a speech command as a trigger to separate each referencing event. We observe differences in driver behavior in the two pointing use cases (i.e. for inside and outside objects), especially when analyzing the preciseness of the three modalities eye, head, and finger. We conclude that there is no single modality that is solely optimal for all cases as each modality reveals certain limitations. Fusion of multiple modalities exploits the relevant characteristics of each modality, hence overcoming the case dependent limitations of each individual modality. Ultimately, we propose a method to identity whether the driver’s referenced object lies inside or outside the vehicle, based on the predicted pointing direction.

Multimodal Error Correction for Speech-to-Text in a Mobile Office Automated Vehicle: Results From a Remote Study

Future users of automated vehicles will demand the ability to perform diverse and extensive non-driving related tasks. However, prevailing restrictions in the car require new interaction concepts to enable productive office work. Intelligent voice-based interfaces may be a solution to facilitate productivity while at the same time keeping the “driver in the loop” and thereby maintaining safety. In this work, we investigated the repair problem of productive speech-to-text input in a highly automated vehicle. We examined the user experience of selecting/navigating to an incorrectly recognized word using only speech, pointing and clicking on a touchpad, and using mid-air hand gestures. Results indicate that hand gestures (condition VaG) have high hedonic quality but are not considered viable for error correction in productive text input. On the other hand, the unimodal (Voice-only; baseline) and touchpad-based point-and-click (VaT) approaches to error correction were rated equally well in the hypothesized “mobile office” automated vehicle. The utilized remote study execution methodology proved to be a useful intermediary tool between pure online surveys and on-site studies for qualitative research during a pandemic but suffered from a lack of fidelity and options for objective usability and safety evaluation.

Hand Gesture Recognition for an Off-the-Shelf Radar by Electromagnetic Modeling and Inversion

Microwave radar sensors in human-computer interactions have several advantages compared to wearable and image-based sensors, such as privacy preservation, high reliability regardless of the ambient and lighting conditions, and larger field of view. However, the raw signals produced by such radars are high-dimension and relatively complex to interpret. Advanced data processing, including machine learning techniques, is therefore necessary for gesture recognition. While these approaches can reach high gesture recognition accuracy, using artificial neural networks requires a significant amount of gesture templates for training and calibration is radar-specific. To address these challenges, we present a novel data processing pipeline for hand gesture recognition that combines advanced full-wave electromagnetic modelling and inversion with machine learning. In particular, the physical model accounts for the radar source, radar antennas, radar-target interactions and target itself, i.e.,, the hand in our case. To make this processing feasible, the hand is emulated by an equivalent infinite planar reflector, for which analytical Green’s functions exist. The apparent dielectric permittivity, which depends on the hand size, electric properties, and orientation, determines the wave reflection amplitude based on the distance from the hand to the radar. Through full-wave inversion of the radar data, the physical distance as well as this apparent permittivity are retrieved, thereby reducing by several orders of magnitude the dimension of the radar dataset, while keeping the essential information. Finally, the estimated distance and apparent permittivity as a function of gesture time are used to train the machine learning algorithm for gesture recognition. This physically-based dimension reduction enables the use of simple gesture recognition algorithms, such as template-matching recognizers, that can be trained in real time and provide competitive accuracy with only a few samples. We evaluate significant stages of our pipeline on a dataset of 16 gesture classes, with 5 templates per class, recorded with the Walabot, a lightweight, off-the-shelf array radar. We also compare these results with an ultra wideband radar made of a single horn antenna and lightweight vector network analyzer, and a Leap Motion Controller.

Mind-proofing Your Phone: Navigating the Digital Minefield with GreaseTerminator

Digital harms are widespread in the mobile ecosystem. As these devices gain ever more prominence in our daily lives, so too increases the potential for malicious attacks against individuals. The last line of defense against a range of digital harms – including digital distraction, political polarisation through hate speech, and children being exposed to damaging material – is the user interface. This work introduces GreaseTerminator to enable researchers to develop, deploy, and test interventions against these harms with end-users. We demonstrate the ease of intervention development and deployment, as well as the broad range of harms potentially covered with GreaseTerminator in five in-depth case studies.

SESSION: Session 7: Interacting with Machine Learning

Building Trust in Interactive Machine Learning via User Contributed Interpretable Rules

Machine learning technologies are increasingly being applied in many different domains in the real world. As autonomous machines and black-box algorithms begin making decisions previously entrusted to humans, great academic and public interest has been spurred to provide explanations that allow users to understand the decision-making process of the machine learning model. Besides explanations, Interactive Machine Learning (IML) seeks to leverage user feedback to iterate on an ML solution to correct errors and align decisions with those of the users. Despite the rise in explainable AI (XAI) and Interactive Machine Learning (IML) research, the links between interactivity, explanations, and trust have not been comprehensively studied in the machine learning literature. Thus, in this study, we develop and evaluate an explanation-driven interactive machine learning (XIML) system with the Tic-Tac-Toe game as a use case to understand how a XIML mechanism improves users’ satisfaction with the machine learning system. We explore different modalities to support user feedback through visual or rules-based corrections. Our online user study (n = 199) supports the hypothesis that allowing interactivity within this XIML system causes participants to be more satisfied with the system, while visual explanations play a less prominent (and somewhat unexpected) role. Finally, we leverage a user-centric evaluation framework to create a comprehensive structural model to clarify how subjective system aspects, which represent participants’ perceptions of the implemented interaction and visualization mechanisms, mediate the influence of these mechanisms on the system’s user experience.

HINT: Integration Testing for AI-based features with Humans in the Loop

The dynamic nature of AI technologies makes testing human-AI interaction and collaboration challenging – especially before such features are deployed in the wild. This presents a challenge for designers and AI practitioners as early feedback for iteration is often unavailable in the development phase. In this paper, we take inspiration from integration testing concepts in software development and present HINT (Human-AI INtegration Testing), a crowd-based framework for testing AI-based experiences integrated with a humans-in-the-loop workflow. HINT supports early testing of AI-based features within the context of realistic user tasks and makes use of successive sessions to simulate AI experiences that evolve over-time. Finally, it provides practitioners with reports to evaluate and compare aspects of these experiences.

Through a crowd-based study, we demonstrate the need for over-time testing where user behaviors evolve as they interact with an AI system. We also show that HINT is able to capture and reveal these distinct user behavior patterns across a variety of common AI performance modalities using two AI-based feature prototypes. We further evaluated HINT’s potential to support practitioners’ evaluation of human-AI interaction experiences pre-deployment through semi-structured interviews with 13 practitioners.

Trade-offs in Sampling and Search for Early-stage Interactive Text Classification

For many automated classification tasks, collecting labeled data is the key barrier to training a useful supervised model. Interfaces for interactive labeling tighten the loop of labeled data collection and model development, enabling a subject-matter expert to quickly establish the feasibility of a classifier to address a problem of interest. These interactive machine learning (IML) interfaces iteratively sample unlabeled data for annotation, train a new model, and display feedback on the model’s estimated performance. Different sampling strategies affect both the rate at which the model improves and the bias of performance estimates. We compare the performance of three sampling strategies in the “early-stage” of label collection, starting from zero labeled data. By simulating a user’s interactions with an IML labeling interface, we demonstrate a trade-off between improving a text classifier’s performance and computing unbiased estimates of that performance. We show that supplementing early-stage sampling with user-guided text search can effectively “seed” a classifier with positive documents without compromising generalization performance—particularly for imbalanced tasks where positive documents are rare. We argue for the benefits of incorporating search alongside active learning in IML interfaces and identify design trade-offs around the use of non-random sampling strategies.

Efficiently correcting machine learning: considering the role of example ordering in human-in-the-loop training of image classification models

Arguably the most popular application task in artificial intelligence is image classification using transfer learning. Transfer learning enables models pre-trained on general classes of images, available in large numbers, to be refined for a specific application. This enables domain experts with their own—generally, substantially smaller—collections of images to build deep learning models. The good performance of such models poses the question of whether it is possible to further reduce the effort required to label training data by adopting a human-in-the-loop interface that presents the expert with the current predictions of the model on a new batch of data and only requires correction of these predictions—rather than de novo labelling by the expert—before retraining the model on the extended data. This paper looks at how to order the data in this iterative training scheme to achieve the highest model performance while minimising the effort needed to correct misclassified examples. Experiments are conducted involving five methods of ordering, using four image classification datasets, and three popular pre-trained models. Two of the methods we consider order the examples a priori whereas the other three employ an active learning approach where the ordering is updated iteratively after each new batch of data and retraining of the model. The main finding is that it is important to consider accuracy of the model in relation to the number of corrections that are required: using accuracy in relation to the number of labelled training examples—as is common practice in the literature—can be misleading. More specifically, active methods require more cumulative corrections than a priori methods for a given level of accuracy. Within their groups, active and a priori methods perform similarly. Preliminary evidence is provided that suggests that for “simple” problems, i.e., those involving fewer examples and classes, no method improves upon random selection of examples. For more complex problems, an a priori strategy based on a greedy sample selection method known as “kernel herding” performs best.

SESSION: Session 8: Learning and Playing

Robinhood’s Forest: A Persuasive Idle Game to Improve Investing Behavior

Smartphone-based trading apps such as Robinhood and Webull have risen in popularity over the past few years. These apps allow investors with little prior investing experience easy and inexpensive (often commission-free) access to trading stocks, options, and other securities. However, non-expert investors using these apps often make poor investing decisions due to behavioral factors. In particular, such investors 1) trade more frequently leading to short-term speculation rather than reaping the long-term benefits of their investments, 2) make investing decisions based on emotion rather than economic or financial considerations, and 3) under-diversify their portfolio, leading to unnecessarily large risks. Together, these actions reduce their investor returns, and sometimes lead to devastating losses that were avoidable. This paper introduces Robinhood’s Forest, an idle game that helps non-expert investors improve investing behavior. Unlike interactive digital games that emphasize interactivity, idle games are designed for interpassivity. Idle games are based on the premise that “waiting is playing” and players can derive pleasure by repeating simple actions or automating them. As such, Robinhood’s Forest 1) provides recurring gratification from limiting investing actions and encouraging long-term investing, 2) abstracts representations of investments to reduce overreaction to market news and social pressure, and 3) encourages diversification by using strong visual metaphors. We conducted two small-scale lab studies that demonstrate that Robinhood’s Forest reduced participants’ desire for frequent trading and encouraged them to diversify their portfolios. At the same time, participants in our studies still desired functionality like performance visualizations and market news updates that allowed them to keep up with the market. Based on these findings, we also discuss design implications for other interactive systems that emphasize non-interaction.

“Rather Solve the Problem from Scratch”: Gamesploring Human-Machine Collaboration for Optimizing the Debris Collection Problem

Optimizing operations on critical infrastructure networks is key to reducing the impact of disruptive events. In this paper, we explore the potential of having humans and algorithms work together to address this difficult task. For this purpose, we use a gamified experiment to build and assess this potential in the context of the debris collection problem (i.e., “gamesploring”). We developed a digital game where players can request the help of the computer while facing a multi-objective problem of assigning contractors to road segments for clearing debris in a disaster area. Through a within-subjects experimental study, we assessed how players optimized under various circumstances (e.g., initial solution vs. from scratch) compared to the computer on its own. The results are both surprising as well as insightful: they suggest that human-machine collaboration is indeed beneficial but also that more work is needed on how to appropriately guide this form of collaboration.

Agenda- and Activity-Based Triggers for Microlearning

The ubiquity of mobile devices has fueled the popularity of microlearning, namely informal self-directed learning during brief personal downtime. However, learner engagement is challenging to maintain, and microlearning habits are hard to establish. Scheduled reminders are ineffective as they do not match the users’ variable schedules and their intention or capacity to engage. In this paper, we propose a schedule-based and an activity-based trigger for microlearning. The first trigger is sensitive to the learners’ agenda and device status and includes a snooze mechanism. A four-week study (n=10) showed slightly lower response times when compared to triggers scheduled at a fixed time but did not improve learner engagement. The second trigger initiates audio-based microlearning when plugging in headphones. Thus, we minimize the access to personal data and capture a moment where learners engage with their device for a listening activity. In an exploratory user study (n=10), the plugin trigger achieved high compliance rates and was less likely to induce annoyance in users than lock screen notifications. We conclude that intelligent reminders with simple interaction options can contribute to learner engagement.

An Intelligent Pedagogical Agent to Foster Computational Thinking in Open-Ended Game Design Activities

Free-form Game-Design (GD) environments show promise in fostering Computational Thinking (CT) skills at a young age. However, such environments can be challenging to some students due to their highly open-ended nature. Our long-term goal is to alleviate this difficulty via pedagogical agents that can monitor the student interaction with the environment, detect when the student needs help and provide personalized support accordingly. In this paper, we present a preliminary evaluation of one such agent deployed in a real-word free-form GD learning environment to foster CT in the early K-12 education, Unity-CT. We focus on the effect of repetition by comparing student behaviors between no intervention, 1-shot, and repeated intervention groups for two different errors that are known to be challenging in the online lessons of Unity-CT environment. Our findings showed that the agent was perceived very positively by the students and the repeated intervention showed promising results in terms of helping students make less errors and more correct behaviors, albeit only for one of the two target errors. Based on these results, we provide insights on how to improve the delivery of the agent’s interventions in free-form GD environments.

SoftVideo: Improving the Learning Experience of Software Tutorial Videos with Collective Interaction Data

Many people rely on tutorial videos when learning to perform tasks using complex software. Watching the video for instructions and applying them to target software requires frequent going back-and-forth between the two, which incurs cognitive overhead. Furthermore, users need to constantly compare the two to see if they are following correctly, as they are prone to missing out on subtle differences. We propose SoftVideo, a prototype system that helps users plan ahead before watching each step in tutorial videos and provides feedback and help to users on their progress. SoftVideo is powered by collective interaction data, as experiences of previous learners with the same goal can provide insights into how they learned from the tutorial. By identifying the difficulty and relatedness of each step from the interaction logs, SoftVideo provides information on each step such as its estimated difficulty, lets users know if they completed or missed a step, and suggests tips such as relevant steps when it detects users struggling. To enable such a data-driven system, we collected and analyzed video interaction logs and the associated Photoshop usage logs for two tutorial videos from 120 users. We then defined six metrics that portray the difficulty of each step, including the time taken to complete a step and the number of pauses in a step, which were also used to detect users’ struggling moments by comparing their progress to the collected data. To investigate the feasibility and usefulness of SoftVideo, we ran a user study with 30 participants where they performed a Photoshop task by following along a tutorial video with SoftVideo. Results show that participants could proactively and effectively plan their pauses and playback speed, and adjust their concentration level. They were also able to identify and recover from errors with the help SoftVideo provides.

SESSION: Session 9: Applications and Tools

Interpretable Aesthetic Analysis Model for Intelligent Photography Guidance Systems

An aesthetics evaluation model is at the heart of predicting users’ aesthetic experience and developing user interfaces with higher quality. However, previous methods on aesthetic evaluation largely ignore the interpretability of the model and are consequently not suitable for many human-computer interaction tasks. We solve this problem by using a hyper-network to learn the overall aesthetic rating as a combination of individual aesthetic attribute scores. We further introduce a specially designed attentional mechanism in attribute score estimators to enable the users to know exactly which parts/elements of visual inputs lead to the estimated score. We demonstrate our idea by designing an intelligent photography guidance system. Computational results and user studies demonstrate the interpretability and effectiveness of our method.

VideoSticker: A Tool for Active Viewing and Visual Note-taking from Videos

Video is an effective medium for knowledge communication and learning. Yet active viewing and note-taking from videos remain a challenge. Specifically, during note-taking, viewers find it difficult to extract essential information such as representation, composition, motion, and interactions of graphical objects and narration. Current approaches rely on creating static screenshots, manual clipping, manual annotation and transcription. This is often done by repeatedly pausing and rewinding the video, thus disrupting the viewing experience. We propose VideoSticker, a tool designed to support visual note-taking by extracting expressive content and narratives from videos as ‘object stickers.’ VideoSticker implements automated object detection and tracking, linking objects to the transcript, and supporting rapid extraction of stickers across space, time, and events of interest. VideoSticker’s two-pass approach allows viewers to capture high-level information uninterrupted and later extract specific details. We demonstrate the usability of VideoSticker for a variety of videos and note-taking needs.

NewsPod: Automatic and Interactive News Podcasts

News podcasts are a popular medium to stay informed and dive deep into news topics. Today, most podcasts are handcrafted by professionals. In this work, we advance the state-of-the-art in automatically generated podcasts, making use of recent advances in natural language processing and text-to-speech technology. We present NewsPod, an automatically generated, interactive news podcast. The podcast is divided into segments, each centered on a news event, with each segment structured as a Question and Answer conversation, whose goal is to engage the listener. A key aspect of the design is the use of distinct voices for each role (questioner, responder), to better simulate a conversation. Another novel aspect of NewsPod allows listeners to interact with the podcast by asking their own questions and receiving automatically generated answers. We validate the soundness of this system design through two usability studies, focused on evaluating the narrative style and interactions with the podcast, respectively. We find that NewsPod is preferred over a baseline by participants, with 80% claiming they would use the system in the future.

CiteRead: Integrating Localized Citation Contexts into Scientific Paper Reading

When reading a scholarly paper, scientists oftentimes wish to understand how follow-on work has built on or engages with what they are reading. While a paper itself can only discuss prior work, some scientific search engines can provide a list of all subsequent citing papers; unfortunately, they are undifferentiated and disconnected from the contents of the original reference paper. In this work, we introduce a novel paper reading experience that integrates relevant information about follow-on work directly into a paper, allowing readers to learn about newer papers and see how a paper is discussed by its citing papers in the context of the reference paper. We built a tool, called CiteRead, that implements the following three contributions: 1) automated techniques for selecting important citing papers, building on results from a formative study we conducted, 2) an automated process for localizing commentary provided by citing papers to a place in the reference paper, and 3) an interactive experience that allows readers to seamlessly alternate between the reference paper and information from citing papers (e.g., citation sentences), placed in the margins. Based on a user study with 12 scientists, we found that in comparison to having just a list of citing papers and their citation sentences, the use of CiteRead while reading allows for better comprehension and retention of information about follow-on work.

AQX: Explaining Air Quality Forecast for Verifying Domain Knowledge using Feature Importance Visualization

Air pollution forecast has become critical because of its direct impact on human health and its increased production caused by rapid industrialization. Machine learning (ML) solutions are being drastically explored in this domain because they can potentially produce highly accurate results with access to historical data. However, experts in the environmental area are skeptical about adopting ML solutions in real-world applications and policy making due to their black-box nature. In contrast, despite having low accuracy sometimes, the existing traditional simulation model (e.g., CMAQ) are widely used and follows well-defined and transparent equations. Therefore, presenting the knowledge learned by the ML model can make it transparent as well as comprehensible. In addition, validating the ML model’s learning with the existing domain knowledge might aid in addressing their skepticism, building appropriate trust, and better utilizing ML models. In collaboration with three experts with an average of five years of research experience in the air pollution domain, we identified that feature (meteorological feature like wind) contribution, towards the final forecast as the major information to be verified with domain knowledge. In addition, the accuracy of ML models compared with traditional simulation models and raw wind trajectories are essential for domain experts to validate the feature contribution. Based on the identified information, we designed and developed AQX, a visual analytics system to help experts validate and verify the ML model’s learning with their domain knowledge. The system includes multiple coordinated views to present the contributions of input features at different levels of aggregation in both temporal and spatial dimensions. It also provides a performance comparison of ML and traditional models in terms of accuracy and spatial map, along with the animation of raw wind trajectories for the input period. We further demonstrated two case studies and conducted expert interviews with two domain experts to show the effectiveness and usefulness of AQX.

Utilizing Core-Query for Context-Sensitive Ad Generation Based on Dialogue

In this work, we present a system that sequentially generates advertisements within the context of a dialogue. Advertisements tailored to the user have long been displayed on the digital signage in stores, on web pages, and on smartphone applications. Advertisements will work more effectively if they are aware of the context of the dialogue between the users. Creating an advertising sentence as a query and searching the web by using that query is one way to present a variety of advertisements, but there is currently no method to create an appropriate search query for the search in accordance with the dialogue context. Therefore, we developed a method called the Conversational Context-sensitive Advertisement generator (CoCoA). The novelty of CoCoA is that advertisers simply need to prepare a few abstract phrases, called Core-Queries, and then CoCoA dynamically transforms the Core-Queries into complete search queries in accordance with the dialogue context. Here, “transforms” means to add words related to the context in the dialogue to the prepared Core-Queries. The transformation is enabled by a masked word prediction technique that predicts a word that is hidden in a sentence. Our attempt is the first to apply masked word prediction to a web information retrieval framework that takes into account the dialogue context. We asked users to evaluate the search query presented by CoCoA against the dialogue text of multiple domains prepared in advance and found that CoCoA could present more contextual and effective advertisements than Google Suggest or a method without the query transformation. In addition, we found that CoCoA generated high-quality advertisements that advertisers had not expected when they created the Core-Queries.

SESSION: Session 10: Explainable AI (XAI) 2

Embedding Comparator: Visualizing Differences in Global Structure and Local Neighborhoods via Small Multiples

Embeddings mapping high-dimensional discrete input to lower-dimensional continuous vector spaces have been widely adopted in machine learning applications as a way to capture domain semantics. Interviewing 13 embedding users across disciplines, we find comparing embeddings is a key task for deployment or downstream analysis but unfolds in a tedious fashion that poorly supports systematic exploration. In response, we present the Embedding Comparator, an interactive system that presents a global comparison of embedding spaces alongside fine-grained inspection of local neighborhoods. It systematically surfaces points of comparison by computing the similarity of the k-nearest neighbors of every embedded object between a pair of spaces. Through case studies across multiple modalities, we demonstrate our system rapidly reveals insights, such as semantic changes following fine-tuning, language changes over time, and differences between seemingly similar models. In evaluations with 15 participants, we find our system accelerates comparisons by shifting from laborious manual specification to browsing and manipulating visualizations.

Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models. However, existing approaches often rely on abstract, complex visualizations that poorly map to the task at hand or require non-trivial ML expertise to interpret. Here, we present two interface modules that facilitate intuitively assessing model reliability. To help users better characterize and reason about a model’s uncertainty, we visualize raw and aggregate information about a given input’s nearest neighbors. Using an interactive editor, users can manipulate this input in semantically-meaningful ways, determine the effect on the output, and compare against their prior expectations. We evaluate our approach using an electrocardiogram beat classification case study. Compared to a baseline feature importance interface, we find that 14 physicians are better able to align the model’s uncertainty with domain-relevant factors and build intuition about its capabilities and limitations.

Similarity-Based Explanations meet Matrix Factorization via Structure-Preserving Embeddings

Embeddings are core components of modern model-based Collaborative Filtering (CF) methods, such as Matrix Factorization (MF) and Deep Learning variations. In essence, embeddings are mappings of the original sparse representation of categorical features (e.g., user and items) to dense low-dimensional representations. A well-known limitation of such methods is that the learned embeddings are opaque and hard to explain to the users. On the other hand, a key feature of simpler KNN-based CF models (aka user/item-based CF) is that they naturally yield similarity-based explanations, i.e., similar users/items as evidence to support model recommendations. Unlike related works that try to attribute explicit meaning (via metadata) to the learned embeddings, in this paper, we propose to equip the learned embeddings of MF with meaningful similarity-based explanations. First, we show that the learned user/item embeddings of MF do not preserve the distances between users (or items) in the original rating matrix. Next, we propose a novel approach that initializes Stochastic Gradient Descent (SGD) with user/item embeddings that preserve the structural properties of the original input data. We conduct a broad set of experiments and show that our method enables explanations, very similar to the ones provided by KNN-based approaches, without harming the prediction performance. Moreover, we show that fine-tuning the structure-preserving embeddings may unlock better local minima in the optimization space, leading simple vanilla MF to reach competitive performances with the best-known models for the rating prediction task.

Do People Engage Cognitively with AI? Impact of AI Assistance on Incidental Learning

When people receive advice while making difficult decisions, they often make better decisions in the moment and also increase their knowledge in the process. However, such incidental learning can only occur when people cognitively engage with the information they receive and process this information thoughtfully. How do people process the information and advice they receive from AI, and do they engage with it deeply enough to enable learning? To answer these questions, we conducted three experiments in which individuals were asked to make nutritional decisions and received simulated AI recommendations and explanations. In the first experiment, we found that when people were presented with both a recommendation and an explanation before making their choice, they made better decisions than they did when they received no such help, but they did not learn. In the second experiment, participants first made their own choice, and only then saw a recommendation and an explanation from AI; this condition also resulted in improved decisions, but no learning. However, in our third experiment, participants were presented with just an AI explanation but no recommendation and had to arrive at their own decision. This condition led to both more accurate decisions and learning gains. We hypothesize that learning gains in this condition were due to deeper engagement with explanations needed to arrive at the decisions. This work provides some of the most direct evidence to date that it may not be sufficient to include explanations together with AI-generated recommendation to ensure that people engage carefully with the AI-provided information. This work also presents one technique that enables incidental learning and, by implication, can help people process AI recommendations and explanations more carefully.

Contextualization and Exploration of Local Feature Importance Explanations to Improve Understanding and Satisfaction of Non-Expert Users

The increasing usage of complex Machine Learning models for decision-making has raised interest in explainable artificial intelligence (XAI). In this work, we focus on the effects of providing accessible and useful explanations to non-expert users. More specifically, we propose generic XAI design principles for contextualizing and allowing the exploration of explanations based on local feature importance. To evaluate the effectiveness of these principles for improving users’ objective understanding and satisfaction, we conduct a controlled user study with 80 participants using 4 different versions of our XAI system, in the context of an insurance scenario. Our results show that the contextualization principles we propose significantly improve user’s satisfaction and is close to have a significant impact on user’s objective understanding. They also show that the exploration principles we propose improve user’s satisfaction. On the other hand, the interaction of these principles does not appear to bring improvement on both dimensions of users’ understanding.

SESSION: Session 11: Natural Language

A Dialogue-Based Interface for Active Learning of Activities of Daily Living

While Human Activity Recognition (HAR) systems may benefit from Active Learning (AL) by allowing users to self-annotate their Activities of Daily Living (ADLs), many proposed methods for collecting such annotations are for short-term data collection campaigns for specific datasets. We present a reusable dialogue-based approach to user interaction for active learning in HAR systems, which utilises a dataset of natural language descriptions of common activities (which we make publicly available) and semantic similarity measures. Our approach involves system-initiated dialogue, including follow-up questions to reduce ambiguity in user responses where appropriate. We apply our work to an existing CASAS dataset in an active learning scenario, to demonstrate our work in context, in which a natural language interface provides knowledge that can help interpret other multi-modal sensor data. We provide results highlighting the potential of our dialogue- and semantic similarity-based approach. We evaluate our work: (i) technically, as an effective way to seek users’ input for active learning of ADLs; and (ii) qualitatively, through a user study in which users were asked to use our approach and an established method, and to subsequently compare the two. Results show the potential of our approach as a user-friendly mechanism for annotation of sensor data as part of an active learning system.

BeParrot: Efficient Interface for Transcribing Unclear Speech via Respeaking

Transcribing speech from audio files to text is an important task not only for exploring the audio content in text form but also for utilizing the transcribed data as a source to train speech models, such as automated speech recognition (ASR) models. A post-correction approach has been frequently employed to reduce the time cost of transcription where users edit errors in the recognition results of ASR models. However, this approach assumes clear speech and is not designed for unclear speech (such as speech with high levels of noise or reverberation), which severely degrades the accuracy of ASR and requires many manual corrections. To construct an alternative approach to transcribe unclear speech, we introduce the idea of respeaking, which has primarily been used to create captions for television programs in real time. In respeaking, a proficient human respeaker repeats the heard speech as shadowing, and their utterances are recognized by an ASR model. While this approach can be effective for transcribing unclear speech, one problem is that respeaking is a highly cognitively demanding task and extensive training is often required to become a respeaker. We address this point with BeParrot, the first interface designed for respeaking that allows novice users to benefit from respeaking without extensive training through two key features: parameter adjustment and pronunciation feedback. Our user study involving 60 crowd workers demonstrated that they could transcribe different types of unclear speech 32.2 % faster with BeParrot than with a conventional approach without losing the accuracy of transcriptions. In addition, comments from the workers supported the design of the adjustment and feedback features, exhibiting a willingness to continue using BeParrot for transcription tasks. Our work demonstrates how we can leverage recent advances in machine learning techniques to overcome the area that is still challenging for computers themselves with the help of a human-in-the-loop approach.

Wordcraft: Story Writing With Large Language Models

The latest generation of large neural language models such as GPT-3 have achieved new levels of performance on benchmarks for language understanding and generation. These models have even demonstrated an ability to perform arbitrary tasks without explicit training. In this work, we sought to learn how people might use such models in the process of creative writing. We built Wordcraft, a text editor in which users collaborate with a generative language model to write a story. We evaluated Wordcraft with a user study in which participants wrote short stories with and without the tool. Our results show that large language models enable novel co-writing experiences. For example, the language model is able to engage in open-ended conversation about the story, respond to writers’ custom requests expressed in natural language (such as ”rewrite this text to be more Dickensian”), and generate suggestions that serve to unblock writers in the creative process. Based on these results, we discuss design implications for future human-AI co-writing systems.

KWickChat: A Multi-Turn Dialogue System for AAC Using Context-Aware Sentence Generation by Bag-of-Keywords

We present KWickChat (Keyword Quick Chat): a multi-turn augmentative and alternative communication (AAC) dialogue system for nonspeaking individuals with motor disabilities. The central objective of KWickChat is to reduce the communication gap between nonspeaking and speaking partners by exploring a sentence-based text entry system that automatically generates suitable sentences for the nonspeaking partner based on keyword entry. The system is underpinned by a GPT-2 language model and leverages context information, including dialogue history and persona tags, to improve the quality of the generated responses. We evaluate the system by analyzing the functional design and decomposing it into key functions and parameters that are systematically investigated using envelope analysis. We pursue this methodology as a necessary precursor to evaluation with AAC users. Our results show that with word prediction and with a threshold word error rate of 0.65, the keystroke savings of the KWickChat system is around 71%. To complement the envelope analysis, we also recruited two human judges to evaluate the semantic consistency between 400 sentences generated by KWickChat and reference sentences. Both judges reported a median rating of 4 on a scale from 1 (very bad) to 5 (very good) for the best generated sentence in each exchange and achieved an inter-rater reliability of 0.92 across all 400 sentences judged.

Does Using Voice Authentication in Multimodal Systems Correlate With Increased Speech Interaction During Non-critical Routine Tasks?

Multimodal systems offer their functionalities through multiple communication channels. A messenger application may take either keyboard or voice input, and present incoming messages as text or audio output. This allows the users to communicate with their devices using the modality that best suits their context and personal preference. Authentication is often the first interaction with an application. The users’ login behavior can thus be used to immediately adapt the communication channel to their preferences. Yet given the sensitive nature of authentication, this interaction may not be representative for the user’s inclination to use speech input in non-critical routine tasks. In this paper, we test whether the interactions during authentication differ from non-critical routine tasks in a smart home application. Our findings indicate that, even in such a private space, the authentication behavior does not correlate with the use, nor with the perceived usability of speech input during non-critical task. We further find that short interactions with the system are not indicative of the user’s attitude towards audio output, independent of whether authentication or non-critical tasks are performed. Since security concerns are minmized in the secure environment of private spaces, our findings can be generalized to other contexts where security threats are even more apparent.

iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models

Error analysis in NLP models is essential to successful model development and deployment. One common approach for diagnosing errors is to identify subpopulations in the dataset where the model produces the most errors. However, existing approaches typically define subpopulations based on pre-defined features, which requires users to form hypotheses of errors in advance. To complement these approaches, we propose iSEA, an Interactive Pipeline for Semantic Error Analysis in NLP Models, which automatically discovers semantically-grounded subpopulations with high error rates in the context of a human-in-the-loop interactive system. iSEA enables model developers to learn more about their model errors through discovered subpopulations, validate the sources of errors through interactive analysis on the discovered subpopulations, and test hypotheses about model errors by defining custom subpopulations. The tool supports semantic descriptions of error-prone subpopulations at the token and concept level, as well as pre-defined higher-level features. Through use cases and expert interviews, we demonstrate how iSEA can assist error understanding and analysis.