It has been nearly 20 years since the U.S. Supreme Court ruled on the use of race/ethnicity, gender, and national origin in university admissions in the University of Michigan cases. It is now 2021 and universities are still struggling with how to diversify their admissions offers within the bounds of the law. In response to this ongoing issue, I created Applications Quest, an equitable AI tool that adheres to the legal use of race, gender, national origin, etc. in admissions and hiring decisions. In this keynote address, I will explain how AI, specifically, Applications Quest, can be used to create equitable recommendations for admissions decisions. I will give a demonstration of how Applications Quest increases diversity compared to admissions committees while achieving the same academic achievement levels as the committee. Given Applications Quest is an unsupervised AI, it has the advantage of ignorance of race/ethnicity, gender, national origin, etc. Therefore, when used in admissions, it provides unbiased recommendations that can be interrogated by human evaluators. Applications Quest is a human-centered AI tool for achieving equity in admissions.
Visualization transforms large quantities of data into pictures in which relations, patterns, or trends of interest in the data reveal themselves to effectively guide the user in the data reasoning and discovery process. Visualization has become an essential tool in many areas of study that use a data-driven approach to problem solving and decision making. However, when the data is large relational or high-dimensional, it can take both novices and experts substantial effort to derive and interpret visualization results from the data. Following the resurgence of AI and machine learning technology in recent years, in the field of visualization, there is also the growing interest and opportunity in applying AI and machine learning to perform data transformation and to assist in the generation and interpretation of visualization, aiming to strike a balance between cost and performance. In this talk, I will present designs made by my group effectively making use of machine learning for general data visualization and analytics tasks [1, 2, 3, 4, 5, 6], resulting in better visualization interfaces into the data.
The World Health Organization estimates that over one billion people worldwide are disabled. Innovations at the intersection of AI and HCI have the potential to increase the accessibility of the digital and physical worlds for people experiencing long-term, temporary, and/or situational disabilities. Considering accessibility scenarios can illuminate opportunities and challenges for designers of intelligent user interfaces. In this keynote, I will use two scenarios to illustrate this concept: automatic alt text generation for images and augmentative and alternative communication technologies.
Alternative text (“alt text”) descriptions can be read aloud by screen reader software to increase image accessibility to people who are blind or have low vision. Many content authors fail to provide alt text metadata, leaving billions of digital images inaccessible to screen reader users. Advances in vision-to-language technologies offer promise for scaling the accessibility of digital imagery, but also present challenges such as user-understandable error metrics and the selection of relevant details.
Augmentative and alternative communication (“AAC”) technologies facilitate communication for people with speech disabilities. Many users with extremely limited mobility rely on eye gaze input to control AAC, typically resulting in limited communication bandwidth of 10 – 20 words per minute (compared with nearly 200 words per minute for spoken English). Advances in predictive language technologies have the potential to enhance the speed and expressivity of AAC communications, but present challenges around preserving user autonomy and authenticity.
For both scenarios (automatic alt text and predictive AAC), I will share research on end-user preferences that can inform technology design, as well as presenting novel prototypes that combine human and machine intelligence to support these user needs. I will close by identifying opportunities for future work at the intersection of intelligent user interfaces and accessibility.
Digital health research—the investigation of how technology can be designed to support wellbeing—has exploded in recent years. Much of this innovation has stemmed from advances in the fields of human-computer interaction and artificial intelligence. A growing segment of this work is examining how information and communication technologies (ICTs) can be used to achieve health equity, that is, fair opportunities for all people to live a healthy life. Such advances are sorely needed, as there exist large disparities in morbidity and mortality across population groups. These disparities are due in large part to social determinants of health, that is, social, physical, and economic conditions that disproportionately inhibit wellbeing in populations such as low-socioeconomic status and racial and ethnic minority groups.
Despite years of digital health research and commercial innovation, profound health disparities persist. In this talk, I will argue that to reduce health disparities, ICTs must address social determinants of health. Intelligent interfaces have much to offer in this regard, and yet their affordances—such as the ability to deliver personalized health interventions—can also act as pitfalls. For example, a focus on personalized health interventions has lead to the design of various interfaces focused on individual-level behavior change. While such innovations are important, to achieve health equity there is also a need for complimentary systems that address social relationships. Social ties are a crucial point of focus for digital health research as they can provide meaningful supports for positive health, especially in populations that disproportionately experience health barriers. I will offer a vision for health equity research in which interactive and intelligent systems are designed to help people build social relationships that support wellbeing. By conceptualizing the purview of digital health research as encompassing not only individual but also social change, there is tremendous opportunity to create disruptive health interventions that help achieve health equity.
Within the ongoing process of defining autonomous driving solutions, experience design may represent an important interface between humans and the autonomous vehicle. This paper presents an empirical study that uses different ways of unimodal communication in autonomous driving to communicate awareness and intent of autonomous vehicles. The goal is to provide recommendations for feedback solutions within holistic autonomous driving experiences. 22 test subjects took part in four autonomous, simulated virtual reality shuttle rides and were presented with different unimodal feedback in the form of light, sound, visualisation, text and vibration. The empirical study showed that, compared to a no-feedback baseline ride, light, and visualisation were able to create a positive user experience.
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.
Research in interpretable machine learning proposes different computational and human subject approaches to evaluate model saliency explanations. These approaches measure different qualities of explanations to achieve diverse goals in designing interpretable machine learning systems. In this paper, we propose a benchmark for image and text domains using multi-layer human attention masks aggregated from multiple human annotators. We then present an evaluation study to compare model saliency explanations obtained using Grad-cam and LIME techniques to human understanding and acceptance. We demonstrate our benchmark’s utility for quantitative evaluation of model explanations by comparing it with human subjective ratings and ground-truth single-layer segmentation masks evaluations. Our study results show that our threshold agnostic evaluation method with the human attention baseline is more effective than single-layer object segmentation masks to ground truth. Our experiments also reveal user biases in the subjective rating of model saliency explanations.
Work in AI-based explanation systems has uncovered an interesting contradiction: people prefer and learn best from why explanations but expert esports commentators primarily answer what questions when explaining complex behavior in real-time strategy games. Three possible explanations for this contradiction are: 1.) broadcast audiences are well-informed and do not need why explanations; 2.) consuming why explanations in real-time is too cognitively demanding for audiences; or 3.) producing live why explanations is too difficult for commentators. We answer this open question by investigating the effects of explanation types and presentation modalities on audience recall and cognitive load in the context of an esports broadcast. We recruit 111 Dota 2 players and split them into three groups: the first group views a Dota 2 broadcast, the second group has the addition of an interactive map that provides what explanations, and the final group receives the interactive map with detailed why explanations. We find that participants who receive short interactive text prompts that provide what explanations outperform the no explanation group on a multiple-choice recall task. We also find that participants who receive detailed why explanations submit reports of cognitive load that are higher than the no explanation group. Our evidence supports the conclusion that informed audiences benefit from explanations but do not have the cognitive resources to process why answers in real-time. It also supports the conclusion that stacked explanation interventions across different modalities, like audio, interactivity, and text, can aid real-time comprehension when attention resources are limited. Together, our results indicate that interactive multimedia interfaces can be leveraged to quickly guide attention and provide low-cost explanations to improve intelligibility when time is too scarce for cognitively demanding why explanations.
Recent content creation systems allow users to generate various high-quality content (e.g., images, 3D models, and melodies) by just specifying a parameter set (e.g., a latent vector of a deep generative model). The task here is to search for an appropriate parameter set that produces the desired content. To facilitate this task execution, researchers have investigated user-in-the-loop optimization, where the system samples candidate solutions, asks the user to provide preferential feedback on them, and iterates this procedure until finding the desired solution. In this work, we investigate a novel approach to enhance this interactive process: allowing users to control the sampling behavior. More specifically, we allow users to adjust the balance between exploration (i.e., favoring diverse samples) and exploitation (i.e., favoring focused samples) in each iteration. To evaluate how this approach affects the user experience and optimization behavior, we implement it into a melody composition system that combines a deep generative model with Bayesian optimization. Our experiments suggest that this approach could improve the user’s engagement and optimization performance.
When prototyping AI experiences (AIX), interface designers seek useful and usable ways to support end-user tasks through AI capabilities. However, AI poses challenges to design due to its dynamic behavior in response to training data, end-user data, and feedback. Designers must consider AI’s uncertainties and offer adaptations such as explainability, error recovery, and automation vs. human task control. Unfortunately, current prototyping tools assume a black-box view of AI, forcing designers to work with separate tools to explore machine learning models, understand model performance, and align interface choices with model behavior. This introduces friction to rapid and iterative prototyping. We propose Model-Informed Prototyping (MIP), a workflow for AIX design that combines model exploration with UI prototyping tasks. Our system, ProtoAI, allows designers to directly incorporate model outputs into interface designs, evaluate design choices across different inputs, and iteratively revise designs by analyzing model breakdowns. We demonstrate how ProtoAI can readily operationalize human-AI design guidelines. Our user study finds that designers can effectively engage in MIP to create and evaluate AI-powered interfaces during AIX design.
Approximately 50% of development resources are devoted to UI development tasks [8]. Occupied a large proportion of development resources, developing icons can be a time-consuming task, because developers need to consider not only effective implementation methods but also easy-to-understand descriptions. In this study, we define 100 icon classes through an iterative open coding for the existing icon design sharing website. Based on a deep learning model and computer vision methods, we propose an approach to automatically convert icon images to fonts with descriptive labels, thereby reducing the laborious manual effort for developers and facilitating UI development. We quantitatively evaluate the quality of our method in the real world UI development environment and demonstrate that our method offers developers accurate, efficient, readable, and usable code for icon images, in terms of saving 65.2% developing time.
During the design of graphical user interfaces (GUIs), one typical objective is to ensure compliance with pertinent style guides, ongoing design practices, and design systems. However, designing compliant layouts is challenging, time-consuming, and can distract creative thinking in design. This paper presents a method for interactive layout transfer, where the layout of a source design – typically an initial rough working draft – is transferred automatically using a selected reference/template layout while complying with relevant guidelines. Our integer programming (IP) method extends previous work in two ways: first, by showing how to transform a rough draft into the final target layout using a reference template and, second, by extending IP-based approaches to adhere to guidelines. We demonstrate how to integrate the method into a real-time interactive GUI sketching tool. Evaluation results are presented from a case study and from an online experiment where the perceived quality of layouts was assessed.
We propose a new method for explanations in Artificial Intelligence (AI) and a tool to test its expressive power within a user interface. In order to bridge the gap between philosophy and human-computer interfaces, we show a new approach for the generation of interactive explanations based on a sophisticated pipeline of AI algorithms for structuring natural language documents into knowledge graphs, answering questions effectively and satisfactorily. Among the mainstream philosophical theories of explanation we identified one that in our view is more easily applicable as a practical model for user-centric tools: Achinstein’s Theory of Explanation. With this work we aim to prove that the theory proposed by Achinstein can be actually adapted for being implemented into a concrete software application, as an interactive process answering questions. To this end we found a way to handle the generic (archetypal) questions that implicitly characterise an explanatory processes as preliminary overviews rather than as answers to explicit questions, as commonly understood. To show the expressive power of this approach we designed and implemented a pipeline of AI algorithms for the generation of interactive explanations under the form of overviews, focusing on this aspect of explanations rather than on existing interfaces and presentation logic layers for question answering. Accordingly, through the identification of a minimal set of archetypal questions it is possible to create a generator of explanatory overviews that is generic enough to significantly ease the acquisition of knowledge by humans, regardless of the specificities of the users outside of a minimum set of very broad requirements (e.g. people able to read and understand English and capable of performing basic common-sense reasoning). We tested our hypothesis on a well-known XAI-powered credit approval system by IBM, comparing CEM, a static explanatory tool for post-hoc explanations, with an extension we developed adding interactive explanations based on our model. The results of the user study, involving more than 100 participants, showed that our proposed solution produced a statistically relevant improvement on effectiveness (U=931.0, p=0.036) over the baseline, thus giving evidence in favour of our theory.
One challenge in highly automated driving is the safe transfer of control (ToC). A safe ToC requires estimating the take-over time depending on the driver’s state in different environmental conditions, to adapt the timing and design of the ToC request. We introduce environmental complexity as one factor that affects the ToC time. In a driving simulator experiment (N=12), the participants drove in five scenes having different environmental complexities (i.e. density and height of the background objects) with and without a secondary task. The results revealed that the ToC time is proportional to the environmental complexity. Thus, in the same driving task and the same traffic, an increasing environmental complexity yields higher ToC times in both conditions, with and without a secondary task. Our model of environmental complexity is a first step towards measuring the complexity of the real world, for a better prediction of ToC times in highly automated driving.
Watching TV has become a side event rather than a deliberate pastime. Movie directors thus struggle to find new ways to sustain the attention of their audience. Interactive movies usually require the viewer to actively decide how the plot progresses, creating an experience more akin to video games than film. In this paper, we propose a system that analyses gaze data to personalize the plot of a video without the viewer’s active intervention. User preferences are inferred from their gaze allocation to different elements in a scene. The subsequent scene is then dynamically tailored towards the user’s predicted preference. In a user study (N = 175), we evaluate the effectiveness of the system with regard to user engagement. Our findings show that personalized videos have a positive effect on focused attention and involvement, whereas novelty perception is not significantly affected.
As the use of AI algorithms keeps rising continuously, so does the need for their transparency and accountability. However, literature often adopts a one-size-fits-all approach for developing explanations when in practice, the type of explanations needed depends on the type of end-user. This research will look at user expertise as a variable to see how different levels of expertise influence the understanding of explanations. The first iteration consists of developing two common types of explanations (visual and textual explanations) that explain predictions made by a general class of predictive model learners. These explanations are then evaluated by users of different expertise backgrounds to compare the understanding and ease-of-use of each type of explanation with respect to the different expertise groups. Results show strong differences between experts and lay users when using visual and textual explanations, as well as lay users having a preference for visual explanations which they perform significantly worse with. To solve this problem, the second iteration of this research focuses on the shortcomings of the first two explanations and tries to minimize the difference in understanding between both expertise groups. This is done through the means of developing and testing a candidate solution in the form of hybrid explanations, which essentially combine both visual and textual explanations. This hybrid form of explanations shows a significant improvement in terms of correct understanding (for lay users in particular) when compared to visual explanations, whilst not compromising on ease-of-use at the same time.
The opaque nature of many intelligent systems violates established usability principles and thus presents a challenge for human-computer interaction. Research in the field therefore highlights the need for transparency, scrutability, intelligibility, interpretability and explainability, among others. While all of these terms carry a vision of supporting users in understanding intelligent systems, the underlying notions and assumptions about users and their interaction with the system often remain unclear.
We review the literature in HCI through the lens of implied user questions to synthesise a conceptual framework integrating user mindsets, user involvement, and knowledge outcomes to reveal, differentiate and classify current notions in prior work. This framework aims to resolve conceptual ambiguity in the field and enables researchers to clarify their assumptions and become aware of those made in prior work. We thus hope to advance and structure the dialogue in the HCI research community on supporting users in understanding intelligent systems.
Keyboard interaction patterns on a smartphone is the input for many intelligent emotion-aware applications, such as adaptive interface, optimized keyboard layout, automatic emoji recommendation in IM applications. The simplest approach, called the Experience Sampling Method (ESM), is to systematically gather self-reported emotion labels from users, which act as the ground truth labels, and build a supervised prediction model for emotion inference. However, as manual self-reporting is fatigue-inducing and attention-demanding, the self-report requests are to be scheduled at favorable moments to ensure high fidelity response. We, in this paper, perform fine-grain keyboard interaction analysis to determine suitable probing moments. Keyboard interaction patterns, both cadence, and latency between strokes, nicely translate to frequency and time domain analysis of the patterns. In this paper, we perform a 3-week in-the-wild study (N = 22) to log keyboard interaction patterns and self-report details indicating (in)opportune probing moments. Analysis of the dataset reveals that time-domain features (e.g., session length, session duration) and frequency-domain features (e.g., number of peak amplitudes, value of peak amplitude) vary significantly between opportune and inopportune probing moments. Driven by these analyses, we develop a generalized (all-user) Random Forest based model, which can identify the opportune probing moments with an average F-score of 93%. We also carry out the explainability analysis of the model using SHAP (SHapley Additive exPlanations), which reveals that the session length and peak amplitude have strongest influence to determine the probing moments.
People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. This requirement imposes a cumbersome burden on the users, and the system cannot help remind them of unexpectedly lost objects. We propose GO-Finder (“Generic Object Finder”), a registration-free wearable camera based system for assisting people in finding an arbitrary number of objects based on two key features: automatic discovery of hand-held objects and image-based candidate selection. Given a video taken from a wearable camera, Go-Finder automatically detects and groups hand-held objects to form a visual timeline of the objects. Users can retrieve the last appearance of the object by browsing the timeline through a smartphone app. We conducted a user study to investigate how users benefit from using GO-Finder and confirmed improved accuracy and reduced mental load regarding the object search task by providing clear visual cues on object locations.
When designing a smart glove for gesture recognition, the set of sensors available and their layout on the glove are crucial. However, once a computational model reaches acceptable recognition accuracy, it is often not clear which sensors are more important for the task. Nor whether some sensors can be strategically removed while retaining similar performance in order to save cost. Furthermore, when aiming for a personalized setup, there can be minor deviation in how gestures are performed by each participant, and so the importance of a sensor may vary between participants. In this paper, we use feature selection to explore whether a personalised glove can be produced, and whether the set of significant sensors persist between users. We present a deep learning algorithm which utilises a layer of weights to estimate the importance of each sensor in relation to each other. Besides estimating importance in relation to recognition accuracy, it is demonstrated how the importance estimates can be extended to take into account factors external to the computational model, such as costs. This allows for a cost effective elimination of sensors to reduce hardware redundancy whilst having a controlled impact on performance. We provide 2 methods: generic or specific. The generic method exploits the importance estimate from all participants to select a set of sensors for removal. Whereas the specific method estimates importance, and removes sensors based on individuals to provide a personalised setup.
Teeth gestures become an alternative input modality for different situations and accessibility purposes. In this paper, we present TeethTap, a novel eyes-free and hands-free input technique, which can recognize up to 13 discrete teeth tapping gestures. TeethTap adopts a wearable 3D printed earpiece with an IMU sensor and a contact microphone behind both ears, which works in tandem to detect jaw movement and sound data, respectively. TeethTap uses a support vector machine to classify gestures from noise by fusing acoustic and motion data, and implements K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement using motion data for gesture classification. A user study with 11 participants demonstrated that TeethTap could recognize 13 gestures with a real-time classification accuracy of 90.9% in a laboratory environment. We further uncovered the accuracy differences on different teeth gestures when having sensors on single vs. both sides. Moreover, we explored the activation gesture under real-world environments, including eating, speaking, walking and jumping. Based on our findings, we further discussed potential applications and practical challenges of integrating TeethTap into future devices.
Automated Machine Learning (AutoML) is a rapidly growing set of technologies that automate the model development pipeline by searching model space and generating candidate models. A critical, final step of AutoML is human selection of a final model from dozens of candidates. In current AutoML systems, selection is supported only by performance metrics. Prior work has shown that in practice, people evaluate ML models based on additional criteria, such as the way a model makes predictions. Comparison may happen at multiple levels, from types of errors, to feature importance, to how the model makes predictions of specific instances. We developed Model LineUpper to support interactive model comparison for AutoML by integrating multiple Explainable AI (XAI) and visualization techniques. We conducted a user study in which we both evaluated the system and used it as a technology probe to understand how users perform model comparison in an AutoML system. We discuss design implications for utilizing XAI techniques for model comparison and supporting the unique needs of data scientists in comparing AutoML models.
Recommender systems can be used to help users discover novel items and explore new tastes, for example in music genre exploration. However, little work has studied how to improve users’ understandability and acceptance of the novel items as well as support users to explore a new domain. In this paper, we investigate how two different visualizations and mood control influence the perceived control, informativeness and understandability of a music genre exploration tool, and further to improve the helpfulness for new music genre exploration. Specifically, we compare a bar chart visualization used by earlier work to a contour plot which allows users to compare their musical preferences with both the recommended tracks as well as the new genre. Mood control is implemented with two sliders to set a preferred mood on energy and valence features (that correlate with psychological mood dimensions). In the online user study, mood control was manipulated between subjects, and the visualizations were compared within subjects. During the study (N=102), we measured users’ subjective perceptions, experiences and the interactions with the system. Our results show that the contour plot visualization is perceived more helpful to explore new genres than the bar chart visualization, as the contour plot is perceived to be more informative and understandable. Users spent significantly more time and used the mood control more in the contour plot than in the bar chart visualization. Overall, our results show that the contour plot visualization combined with mood control serves as the most helpful way for new music genre exploration, because the mood control is easier to understand and use when made transparent via an informative visualization.
Texting relies on screen-centric prompts designed for sighted users, still posing significant barriers to people who are blind and visually impaired (BVI). Can we re-imagine texting untethered from a visual display? In an interview study, 20 BVI adults shared situations surrounding their texting practices, recurrent topics of conversations, and challenges. Informed by these insights, we introduce TextFlow: a mixed-initiative context-aware system that generates entirely auditory message options relevant to the users’ location, activity, and time of the day. Users can browse and select suggested aural messages using finger-taps supported by an off-the-shelf finger-worn device, without having to hold or attend to a mobile screen. In an evaluative study, 10 BVI participants successfully interacted with TextFlow to browse and send messages in screen-free mode. The experiential response of the users shed light on the importance of bypassing the phone and accessing rapidly controllable messages at their fingertips while preserving privacy and accuracy with respect to speech or screen-based input. We discuss how non-visual access to proactive, contextual messaging can support the blind in a variety of daily scenarios.
In this paper, we design novel interactive deep learning methods to improve semantic interactions in visual analytics applications. The ability of semantic interaction to infer analysts’ precise intents during sensemaking is dependent on the quality of the underlying data representation. We propose the DeepSIfinetune framework that integrates deep learning into the human-in-the-loop interactive sensemaking pipeline, with two important properties. First, deep learning extracts meaningful representations from raw data, which improves semantic interaction inference. Second, semantic interactions are exploited to fine-tune the deep learning representations, which then further improves semantic interaction inference. This feedback loop between human interaction and deep learning enables efficient learning of user- and task-specific representations. To evaluate the advantage of embedding the deep learning within the semantic interaction loop, we compare DeepSIfinetune against a state-of-the-art but more basic use of deep learning as only a feature extractor pre-processed outside of the interactive loop. Results of two complementary studies, a human-centered qualitative case study and an algorithm-centered simulation-based quantitative experiment, show that DeepSIfinetune more accurately captures users’ complex mental models with fewer interactions.
Organizing fingerings, i.e., choosing which fingers to press on which positions and strings, is a crucial step for playing the violin. As the violin fingering comprises several components, the mapping of a musical phrase to the corresponding fingering arrangement is not unique, and it requires comprehensive musical knowledge for organizing adequate fingerings. In this paper, we study the human-machine cooperative approach to the generation of violin fingering, aiming to build an intelligent system which can provide multiple generation paths and yield adaptable fingering arrangements. For this sake, we compile a new dataset with fingering annotations of multiple versions of performance, propose a deep neural network with conditions on the left-hand movement for fingering generation, and conduct an in-depth user study for detailed responses. Result shows that the proposed system can yield various fingering arrangements according to different performance requirements, though a single generation may not satisfy all the requirements at a time. This highlights the importance of multi-path and human-in-the-loop architecture for violin fingering generation.
Synthetic data generation to improve classification performance (data augmentation) is a well-studied problem. Recently, generative adversarial networks (GAN) have shown superior image data augmentation performance, but their suitability in gesture synthesis has received inadequate attention. Further, GANs prohibitively require simultaneous generator and discriminator network training. We tackle both issues in this work. We first discuss a novel, device-agnostic GAN model for gesture synthesis called DeepGAN. Thereafter, we formulate DeepNAG by introducing a new differentiable loss function based on dynamic time warping and the average Hausdorff distance, which allows us to train DeepGAN’s generator without requiring a discriminator. Through evaluations, we compare the utility of DeepGAN and DeepNAG against two alternative techniques for training five recognizers using data augmentation over six datasets. We further investigate the perceived quality of synthesized samples via an Amazon Mechanical Turk user study based on the HYPE∞ benchmark. We find that DeepNAG outperforms DeepGAN in accuracy, training time (up to 17 × faster), and realism, thereby opening the door to a new line of research in generator network design and training for gesture synthesis. Our source code is available at https://www.deepnag.com.
Engineering students need practical, open-ended problems to help them build their problem-solving skills and design abilities. However, large class sizes create a grading challenge for instructors as there is simply not enough time nor support to provide adequate feedback on many design problems. In this work, we describe an intelligent user interface to provide automated real-time feedback on hand-drawn free body diagrams that is capable of analyzing the internal forces of a sketched truss to evaluate open-ended design problems. The system is driven by sketch recognition algorithms developed for recognizing trusses and a robust linear algebra approach for analyzing trusses. Students in an introductory statics course were assigned a truss design problem as a homework assignment using either paper or our software. We used conventional content analysis on four focus groups totaling 16 students to identify key aspects of their experiences with the design problem and our software. We found that the software correctly analyzed all student submissions, students enjoyed the problem compared to typical homework assignments, and students found the problem to be good practice. Additionally, students using our software reported less difficulty understanding the problem, and the majority of all students said they would prefer the software approach over pencil and paper. We also evaluated the recognition performance on a set of 3000 sketches resulting in an f-score of 0.997. We manually reviewed the submitted student work which showed the handful of student complaints about recognition were largely due to user error.
Steady-State-Visually-Evoked-Potential (SSVEP) Brain-Computer Interfaces (BCIs) make use of flickering stimuli to determine the target a user is looking at and select commands accordingly. Those types of BCI can be operated with little to no training, achieve high classification accuracies and are robust in application. A drawback of this approach is the reduced user comfort due to the constant flickering of the stimuli which can be annoying and tiring to look at. Existing studies addressing this issue try to make use of motion to disguise the oscillating patterns. However, this makes them look abstract and restricts the design of those applications as those patterns do not blend in to conventional user interfaces. In this work we introduce the concept of spinning icons to evoke SSVEPs. The icons are rotating in a certain frequency around their vertical axis and are supposed to appear more natural and be less stressing for the human eye. Furthermore this concept is not bound to any kind of abstract motion based pattern but rather supposed to work with any type of icon or image. The newly designed stimuli were evaluated in an application-oriented scenario and compared to standard and state-of-the-art movement-based SSVEP stimuli regarding the classification accuracy and experienced visual fatigue. The results show that the newly created stimuli performed equally well and partially even better in terms of classification accuracy and were rated throughout better concerning visual fatigue by the study participants. This work therefore lays the foundation for more comfortable SSVEP-BCIs which can be used with basically every icon or UI element spinning around their vertical axis.
Human-in-the-loop machine learning is widely used in artificial intelligence (AI) to elicit labels for data points from experts or to provide feedback on how close the predicted results are to the target. This simplifies away all the details of the decision-making process of the expert. In this work, we allow the experts to additionally produce decision rules describing their decision-making; the rules are expected to be imperfect but to give additional information. In particular, the rules can extend to new distributions, and hence enable significantly improving performance for cases where the training and testing distributions differ, such as in domain adaptation. We apply the proposed method to lifelong learning and domain adaptation problems and discuss applications in other branches of AI, such as knowledge acquisition problems in expert systems. In simulated and real-user studies, we show that decision rule elicitation improves domain adaptation of the algorithm and helps to propagate expert’s knowledge to the AI model.
As videos progressively take a central role in conveying information on the Web, current linear-consumption methods that involve spending time proportional to the duration of the video need to be revisited. In this work, we present NoVoExp, a method that enables a Non-linear Video Consumption Experience by generating a sequence of multimodal fragments that represents the content in different segments of the videos in a succinct fashion. These fragments aid understanding the content of the video without watching it in entirely and serve as pointers to different segments of the video, enabling a new mechanism to consume videos. We design several baselines by building on top of video captioning and video summarization works to understand the relative advantages and disadvantages of NoVoExp, and compare the performances across video durations (short, medium, long) and categories (entertainment, lectures, tutorials). We observe that the sequences of multimodal fragments generated by NoVoExp have higher relevance to the video and are more diverse yet coherent. Our extensive evaluation using automated metrics and human studies show that our fragments are not only good at representing the contents of the video, but also align well with targeted viewer preferences.
Human Computer Interaction can be impeded by various interaction obstacles, impacting a user’s perception or cognition. In this work, we detect and discriminate such interaction obstacles from different data modalities to compensate for them through User Interface (UI) adaptation. For example, we detect memory-based obstacles from brain activity and compensate through repetition of information in the UI; we detect visual obstacles from user behavior and compensate by complementing visual with auditory information in the UI. Online cognitive adaptive systems should be able to decide the most suitable UI adaptation given inputs from several obstacles detectors. In this paper, we employ a Bayesian fusion approach upon different underlying obstacles detectors over multiple consecutive interaction sessions. Experimental results show that the model promisingly outperforms the baseline in the first interaction with an average accuracy of 72.5% and further improves drastically in subsequent interactions with additional information, with an average accuracy of 98%.
As the volume of content and the connectivity of social media have grown, snackable content has increasingly become an enjoyable and engaging way to share content. Snackable content is a shortened form of original content focusing on a single theme or motif for entertainment and quick understanding of a video moment. For content owners with a large library of long-form content (movies, television series, documentaries, etc.), one challenge in accommodating snackable content in social media uses is the correct identification and cutting of interesting regions. Related problems have been studied for algorithmic discovery of content for movie trailers, short-duration meme content, and medium duration news stories, but none of these approaches included user preferences as explicit drivers for cuts. This paper analyzes both human and automatic methods for creating snackable clips across different categories of content with two comprehensive user studies. Contrary to initial expectations, findings amongst the surveyed population indicate a preference for slightly longer snackable clips (60-90 seconds) and those that began or ended with a human character.
The aim of this study was to determine whether reinforcement learning could increase user engagement in interactive art installations. Building on a physical interactive art installation called The Plants [1] by Playable Streets, reinforcement learning was integrated into a web application adapted from the physical interactive art piece. The original installation consisted of real plants that visitors could touch to produce sounds. A digital model and interface was developed to simulate the physical installation. A user study was conducted with 178 participants. Three modes were examined: the original settings of the installation as designed by the artist; a predetermined fixed schedule of consistently changing sound banks; and a reinforcement learning mode, where an agent changes the interactive behaviours to maximise user engagement. User engagement was estimated by comparing the number of touches by an individual over successive time intervals.
From the trial, it was found that reinforcement learning was able to improve average engagement levels of users by nearly 27%. However, reinforcement learning was not able to increase the average duration users interacted with the installation.
Creating a high-quality layout design from scratch is difficult for novices. Therefore, novices often consult the works of other skilled designers for ideas regarding layout designs. Researchers have previously investigated methods to support the layout design process; these works mainly focused on retrieval methods for similar layout designs, or refinement of existing layouts. To enhance user creativity in designing layouts, assistance is needed for exploring various designs. Herein, we propose a novel deep generative model that enables the generation of various layout designs and guarantees continuous and gradual changes in layouts, for effectively exploring graphic designs. Accordingly, we present an adversarial training method with dual critic networks; we trained our model by a public graphic design dataset. We developed another interaction method that allows the user to change the graphic designs between two different layout styles and categories parametrically. We demonstrated the efficacy of the proposed method in generating rich layout variations with representation of latent space by comparing the layout designs generated by our model with by an existing model.
As music tempo can influence human actions pace, we use tempi variations to vary the difficulty in gaming situations based on the synchronization of events, like actions on moving objects in arcade games or waves of enemies in FPS. In this work, musical tempi are exploited to hinder or help Tetris players. Over the first phase of this work involving 44 players and more than 230 Tetris games we discovered surprising interactions between different tempo characteristics influencing the player’s performance. The positive or negative effects of specific tempi settings we discovered were validated in a second phase involving 19 players and 50 Tetris games. Once the effect was chosen, it was dynamically triggered according to certain conditions that were validated during the first phase of this work. Results show that a transition from a staircase increase to a more gradual increase in tempo significantly hinders Tetris players when both tempi are synchronous with the gameplay whereas the same transition help players when both tempi are not synchronized with game actions. Our approach provides new and valuable insight to varying video-game difficulty when gaming situations ask the player to increasingly synchronize with the pace of the game.
Software tools for media production have largely been adapted from physical media paradigms, offering blank canvases upon which to import, combine, and process content. In music production, this increasingly involves meticulous manual assembly of audio clips often carefully curated from diverse sources. As collections of audio content scale upwards in sample size, diversity, and number, creative projects require exponentially more time, effort, and attention to effectively shape them. New tools must find new ways to contend with this abundance of content. We propose the Sound Sketchpad, an algorithm-in-the-loop audio-graphical system and interface for combining sounds from a database into new music. It allows a user to sketch broad musical ideas by making sound, and then interactively modify and refine the resulting composition by drawing visual paths. We discuss the design, implementation, and advantages of this approach.
Fairness is an important aspect in group recommender systems (GRSs). They must ensure that potentially diverse preferences of all group members are taken into consideration when providing recommendations. Previous work has proposed a number of conflict elicitation and merging techniques to produce preferable recommendations for group members. However, we have yet to understand the influence of user personality on the perception of fairness in GRSs. To examine this gap, we use music recommendation as an example domain. We have developed a web-based group music recommender system using the Spotify API and two simple ranking algorithms: one based on the time the songs were voted by users (time-based) and the other based on a dissimilarity score (dissimilarity-based). A within-subjects experiment was conducted with 45 participants divided into groups of 3 (15 groups). Results showed that openness personality has a negative correlation with the perception that fairness is important in groups.
Unintended consequences of deployed AI systems fueled the call for more interpretability in AI systems. Often explainable AI (XAI) systems provide users with simplifying local explanations for individual predictions but leave it up to them to construct a global understanding of the model behavior. In this work, we examine if non-technical users of XAI fall for an illusion of explanatory depth when interpreting additive local explanations. We applied a mixed methods approach consisting of a moderated study with 40 participants and an unmoderated study with 107 crowd workers using a spreadsheet-like explanation interface based on the SHAP framework. We observed what non-technical users do to form their mental models of global AI model behavior from local explanations and how their perception of understanding decreases when it is examined.
This paper contributes to the growing literature in empirical evaluation of explainable AI (XAI) methods by presenting a comparison on the effects of a set of established XAI methods in AI-assisted decision making. Specifically, based on our review of previous literature, we highlight three desirable properties that ideal AI explanations should satisfy—improve people’s understanding of the AI model, help people recognize the model uncertainty, and support people’s calibrated trust in the model. Through randomized controlled experiments, we evaluate whether four types of common model-agnostic explainable AI methods satisfy these properties on two types of decision making tasks where people perceive themselves as having different levels of domain expertise in (i.e., recidivism prediction and forest cover prediction). Our results show that the effects of AI explanations are largely different on decision making tasks where people have varying levels of domain expertise in, and many AI explanations do not satisfy any of the desirable properties for tasks that people have little domain expertise in. Further, for decision making tasks that people are more knowledgeable, feature contribution explanation is shown to satisfy more desiderata of AI explanations, while the explanation that is considered to resemble how human explain decisions (i.e., counterfactual explanation) does not seem to improve calibrated trust. We conclude by discussing the implications of our study for improving the design of XAI methods to better support human decision making.
Algorithms often appear as ’black boxes’ to non-expert users. While prior work focuses on explainable representations and expert-oriented exploration, we propose and study an interactive approach using question answering to explain deterministic algorithms to non-expert users who need to understand the algorithms’ internal states (students learning algorithms, operators monitoring robots, admins troubleshooting network routing). We construct XAlgo—a formal model that first classifies the type of question based on a taxonomy and generates an answer based on a set of rules that extract information from representations of an algorithm’s internal states, the pseudocode. A design probe based on an algorithm learning scenario with 18 participants (9 for a Wizard-of-Oz XAlgo and 9 as a control group) reports findings and design implications based on what kinds of questions people ask, how well XAlgo responds, and what remain as challenges to bridge users’ gulf of algorithm understanding.
EXplainable Artificial Intelligence (XAI) approaches are used to bring transparency to machine learning and artificial intelligence models, and hence, improve the decision-making process for their end-users. While these methods aim to improve human understanding and their mental models, cognitive biases can still influence a user’s mental model and decision-making in ways that system designers do not anticipate. This paper presents research on cognitive biases due to ordering effects in intelligent systems. We conducted a controlled user study to understand how the order of observing system weaknesses and strengths can affect the user’s mental model, task performance, and reliance on the intelligent system, and we investigate the role of explanations in addressing this bias. Using an explainable video activity recognition tool in the cooking domain, we asked participants to verify whether a set of kitchen policies are being followed, with each policy focusing on a weakness or a strength. We controlled the order of the policies and the presence of explanations to test our hypotheses. Our main finding shows that those who observed system strengths early-on were more prone to automation bias and made significantly more errors due to positive first impressions of the system, while they built a more accurate mental model of the system competencies. On the other hand, those who encountered weaknesses earlier made significantly fewer errors since they tended to rely more on themselves, while they also underestimated model competencies due to having a more negative first impression of the model. Our work presents strong findings that aim to make intelligent system designers aware of such biases when designing such tools.
In this paper, we investigate whether information related to touches and rotations impressed to an object can be effectively used to classify the emotion of the agent manipulating it. We specifically focus on sequences of basic actions (e.g., grasping, rotating), which are constituents of daily interactions. We use the iCube, a 5 cm cube covered with tactile sensors and embedded with an accelometer, to collect a new dataset including 11 persons performing action sequences associated with 4 emotions: anger, sadness, excitement and gratitude. Next, we propose 17 high-level hand-crafted features based on the tactile and kinematics data derived from the iCube. Twelve of these features vary significantly as a function of the emotional context in which the action sequence was performed. In particular, a larger surface of the object is engaged in physical contact for anger and excitement, than for sadness. Furthermore, the average duration of interactions labeled as sad, is longer than for the remaining 3 emotions. More rotations are performed for anger and excitement than for sadness and gratitude. The accuracy of a classification experiment in the case of four emotions reaches 0.75. This result shows that the emotion recognition during hand-object interactions is possible and it may foster development of new intelligent user interfaces.
This work proposes a new method for guiding a user’s attention towards objects of interest in a cyber-physical environment (CPE). CPEs are environments that contain several computing systems that interact with each other and with the physical world. These environments contain several sensors (cameras, eye trackers, etc.) and output devices (lamps, screens, speakers, etc.). These devices can be used to first track the user’s position, orientation, and focus of attention to then find the most suitable output device to guide the user’s attention towards a target object. We argue that the most suitable device in this context is the one that attracts attention closest to the target and is salient enough to capture the user’s attention. The method is implemented as a function which estimates the ”closeness” and ”salience” of each visual and auditive output device in the environment. Some parameters of this method are then evaluated through a user study in the context of a virtual reality supermarket. The results show that multi-modal guidance can lead to better guiding performance. However, this depends on the set parameters.
Rich multi-modal information - text, code, images, categorical and numerical data - co-exist in the user interface (UI) design of mobile applications. UI designs are composed of UI entities supporting different functions which together enable the application. To support effective search and recommendation applications over mobile UIs, we need to be able to learn UI representations that integrate latent semantics. In this paper, we propose a novel unsupervised model - Multi-modal Attention-based Attributed Network Embedding (MAAN) model. MAAN is designed to capture both multi-modal and structural network information. Based on the encoder-decoder framework, MAAN aims to learn UI representations that allow UI design reconstruction. The generated embedding can be applied to a variety of tasks: predicting UI elements associated with UI screens, inferring missing UI screen and element attributes, predicting UI user ratings, and retrieving UIs. Extensive experiments, including user evaluations, conducted on two datasets from RICO, a rich real-world mobile UI repository, demonstrates that MAAN out-performs other state-of-the-art models.
Distracted driving is a leading cause of accidents worldwide. The tasks of distraction detection and recognition have been traditionally addressed as computer vision problems. However, distracted behaviors are not always expressed in a visually observable way. In this work, we introduce a novel multimodal dataset of distracted driver behaviors, consisting of data collected using twelve information channels coming from visual, acoustic, near-infrared, thermal, physiological and linguistic modalities. The data were collected from 45 subjects while being exposed to four different distractions (three cognitive and one physical). For the purposes of this paper, we experiment with visual and physiological information and explore the potential of multimodal modeling for distraction recognition. In addition, we analyze the value of different modalities by identifying specific visual and physiological groups of features that contribute the most to distraction characterization. Our results highlight the advantage of multimodal representations and reveal valuable insights for the role played by the two modalities on identifying different types of driving distractions.
Reliable, efficient shared autonomy requires balancing human operation and robot automation on complex tasks, such as dexterous manipulation. Adding to the difficulty of shared autonomy is a robot’s limited ability to perceive the 6 degree-of-freedom pose of objects, which is essential to perform manipulations those objects afforded. Inspired by Monte Carlo Localization, we propose a generative human-in-the-loop approach to estimating object pose. We characterize the performance of our mixed-initiative 3D registration approach using 2D pointing devices via a user study. Seeking an analog for Fitts’s Law for 3D registration, we introduce a new evaluation framework that takes the entire registration process into account instead of only the outcome. When combined with estimates of registration confidence, we posit that mixed-initiative registration will reduce the human workload while maintaining or even improving final pose estimation accuracy.
Labeling data is an important step in the supervised machine learning lifecycle. It is a laborious human activity comprised of repeated decision making: the human labeler decides which of several potential labels to apply to each example. Prior work has shown that providing AI assistance can improve the accuracy of binary decision tasks. However, the role of AI assistance in more complex data-labeling scenarios with a larger set of labels has not yet been explored. We designed an AI labeling assistant that uses a semi-supervised learning algorithm to predict the most probable labels for each example. We leverage these predictions to provide assistance in two ways: (i) providing a label recommendation and (ii) reducing the labeler’s decision space by focusing their attention on only the most probable labels. We conducted a user study (n=54) to evaluate an AI-assisted interface for data labeling in this context. Our results highlight that the AI assistance improves both labeler accuracy and speed, especially when the labeler finds the correct label in the reduced label space. We discuss findings related to the presentation of AI assistance and design implications for intelligent labeling interfaces.
Generative models have become adept at producing artifacts such as images, videos, and prose at human-like levels of proficiency. New generative techniques, such as unsupervised neural machine translation (NMT), have recently been applied to the task of generating source code, translating it from one programming language to another. The artifacts produced in this way may contain imperfections, such as compilation or logical errors. We examine the extent to which software engineers would tolerate such imperfections and explore ways to aid the detection and correction of those errors. Using a design scenario approach, we interviewed 11 software engineers to understand their reactions to the use of an NMT model in the context of application modernization, focusing on the task of translating source code from one language to another. Our three-stage scenario sparked discussions about the utility and desirability of working with an imperfect AI system, how acceptance of that system’s outputs would be established, and future opportunities for generative AI in application modernization. Our study highlights how UI features such as confidence highlighting and alternate translations help software engineers work with and better understand generative NMT models.
Affinity diagramming is a crucial yet time-consuming part of user research in human-centered design. In short, building affinity diagrams involves the hierarchical bottom-up clustering of user statements and observations, which later allow to derive insights and inspire design ideas. To support designers in this process, as a first contribution, we explored seven text-mining models for pre-clustering affinity notes and suggest fastText as most appropriate. Since affinity diagrams are not deterministic, there is no established measure to assess their quality. Our second contribution is, therefore, a thorough examination of the potential of fastText-clusters for design teams regarding technical, psychological and performance-related measures. Compared to reference ‘human built’ affinity diagrams, the fastText-clusters resulted in an overlap index of M = .694 (SD = .034). Surprisingly, a study with four design teams clustering small sets (112 notes) of pre-clustered or randomized affinity notes indicated an increased discussion overhead caused by algorithmic support that led to a decrease in both, efficiency and quality. As a third contribution, we report qualitative data from the instances, where algorithmic support failed designers’ expectations. We conclude that more research on the appropriate time and manner of pre-clustered data presentation is required to harness the full potential of algorithmic support while preserving the spirit of affinity diagramming.
AI Security researchers have identified a new way crowdsourced data can be intentionally compromised. Backdoor attacks are a process through which an adversary creates a vulnerability in a machine learning model by ?poisoning?’ the training set by selectively mislabelling images containing a backdoor object. The model continues to perform well on standard testing data but misclassifies on the inputs that contain the backdoor chosen by the adversary. In this paper, we present the design and development of the Backdoor Game, the first game in which users can interact with different poisoned classifiers and upload their own images containing backdoor objects in an engaging way. We conduct semi-structured interviews with eight different participants who interacted with a first version of the Backdoor Game and deploy the game to Mechanical Turk users (N=68) to demonstrate how users interacted with the backdoor objects. We present results including novel types of interactions that emerged as a result of game play and design recommendations for the improvement of the system. The combined design, development and deployment of our system can help AI Security researchers to study this emerging concept, from determining the effectiveness of different backdoor objects to help compiling a collection of diverse and unique backdoor objects from the public, increasing the safety of future AI systems.
Today, many students learn to speak a foreign language by listening to and repeating pre-recorded materials due to the lack of practice opportunities with human partners. Leveraging recent advancements in AI, Speech, and NLP, we developed EnglishBot, a language learning chatbot that converses with students interactively on college-related topics and provides adaptive feedback. We evaluated EnglishBot against a traditional listen-and-repeat interface with 56 Chinese college students through two six-day user studies under both voluntary and fixed-usage conditions. Students’ fluency improved more with EnglishBot as evaluated by the IELTS grading standard for voluntary learning. EnglishBot users also showed higher engagement and voluntarily spent 2.1 times more time interacting with EnglishBot. Our results suggest that conversational interfaces may benefit foreign learners’ oral language learning, particularly under casual learning settings.
The paper presents a novel model-based method for intelligent tutoring, with particular emphasis on the problem of selecting teaching interventions in interaction with humans. Whereas previous work has focused on either personalization of teaching or optimization of teaching intervention sequences, the proposed individualized model-based planning approach represents convergence of these two lines of research. Model-based planning picks the best interventions via interactive learning of a user memory model’s parameters. The approach is novel in its use of a cognitive model that can account for several key individual- and material-specific characteristics related to recall/forgetting, along with a planning technique that considers users’ practice schedules. Taking a rule-based approach as a baseline, the authors evaluated the method’s benefits in a controlled study of artificial teaching in second-language vocabulary learning (N = 53).
Programming-by-example (PBE), can be a powerful tool to reduce manual work in repetitive data transformation tasks. However, few examples often leave ambiguity and may cause undesirable data transformation by the system. This ambiguity can be resolved by allowing the user to directly edit the synthesized programs; however, this is difficult for non-programmers. Here, we present a novel approach: data-centric disambiguation for data transformation, where users resolve the ambiguity in data transformation by examining and modifying the output rather than the program. The key idea is to focus on the given set of data the user wants to transform instead of pursuing the synthesized program’s generality or completeness. Our system provides visualization and interaction methods that allow users to efficiently examine and fix the transformed outputs, which is much simpler than understanding and modifying the program itself. The user study suggests that our system can successfully help non-programmers to more easily and efficiently process data.
The number of scholarly publications grows steadily every year and it becomes harder to find, assess and compare scholarly knowledge effectively. Scholarly knowledge graphs have the potential to address these challenges. However, creating such graphs remains a complex task. We propose a method to crowdsource structured scholarly knowledge from paper authors with a web-based user interface supported by artificial intelligence. The interface enables authors to select key sentences for annotation. It integrates multiple machine learning algorithms to assist authors during the annotation, including class recommendation and key sentence highlighting. We envision that the interface is integrated in paper submission processes for which we define three main task requirements: The task has to be . We evaluated the interface with a user study in which participants were assigned the task to annotate one of their own articles. With the resulting data, we determined whether the participants were successfully able to perform the task. Furthermore, we evaluated the interface’s usability and the participant’s attitude towards the interface with a survey. The results suggest that sentence annotation is a feasible task for researchers and that they do not object to annotate their articles during the submission process.
Knowledge graphs (KGs) have been popularly used in recommender systems to leverage high-order connections between users and items. Typically, KGs are constructed based on semantic information derived from metadata. However, item images are also highly useful, especially for those domains where visual factors are influential such as fashion items. In this paper, we propose an approach to augment visual information extracted by popularly used image feature extraction methods into KGs. Specifically, we introduce visually-augmented KGs where the extracted information is integrated by using visual factor entities and visual relations. Moreover, to leverage the augmented KGs, a user representation learning approach is proposed to learn hybrid user profiles that combine both semantic and visual preferences. The proposed approaches have been applied in top-N recommendation tasks on two real-world datasets. The results show that the augmented KGs and the representation learning approach can improve the recommendation performance. They also show that the augmented KGs are applicable in the state-of-the-art KG-based recommender system as well.
Dialogue-based conversational recommender systems allow users to give language-based feedback on the recommended item, which has great potential for supporting users to explore the space of recommendations through conversation. In this work, we consider incorporating critiquing techniques into conversational systems to facilitate users’ exploration of music recommendations. Thus, we have developed a music chatbot with three system variants, which are respectively featured with three different critiquing techniques, i.e., user-initiated critiquing (UC), progressive system-suggested critiquing (Progressive SC), and cascading system-suggested critiquing (Cascading SC). We conducted a between-subject study (N=107) to compare these three types of systems with regards to music exploration in terms of user perception and user interaction. Results show that both UC and SC are useful for music exploration, while users perceive higher diversity of recommendations with the system that offers Cascading SC and perceive more serendipitous with the system that offers Progressive SC. In addition, we find that the critiquing techniques significantly moderate the relationships between some interaction metrics (e.g., number of listened songs, number of dialogue turns) and users’ perceived helpfulness and serendipity during music exploration.
The promise of anyone being able to 3D print anywhere relies on both technological advances and incremental shifts in social organizations to trigger changes in human behavior. While much research has focused on how people learn aspects of predefined printing processes, such as expressively utilizing particular design-software (e.g. CAD) and fabrication-machinery (e.g. 3D Printers), this work explores how anyone may gain an understanding of what can be 3D printed through dynamic-processes in computationally-guided exploration of online resources and 3D printing facilities. Investigations surrounding online printing services reveal accessible 3D printing processes that do not require end-users to have experience with design-software or fabrication-machinery, only requiring end-users to specify printable ideas. We present these accessible printing processes alongside associated technologies in a meta-design framework for supporting end-users’ specification of 3D printing ideas. Informed by this framework and a series of formative studies, we designed the website HowDIY to introduce anyone to 3D printing by encouraging and facilitating the intelligent exploration of various online resources. HowDIY was deployed over several weeks with diverse newcomers to 3D printing, validating that intelligent user interfaces can support anyone to participate in the utilization and design of 3D printing tools and processes.
Designing useful human-AI interaction for clinical workflows remains challenging despite the impressive performance of recent AI models. One specific difficulty is a lack of successful examples demonstrating how to achieve safe and efficient workflows while mitigating AI imperfections. In this paper, we present an interactive AI-powered visual search tool that supports pathologists in cancer assessments. Our evaluation with six pathologists demonstrates that it can 1) reduce time needed with maintained quality, 2) build user trust progressively, and 3) learn and improve from use. We describe our iterative design process, model development, and key features. Through interviews, design choices are related to the overall user experience. Implications for future human-AI interaction design are discussed with respect to trust, explanations, learning from use, and collaboration strategies.
Supervised machine learning approaches commonly require good availability and quality of training data. In applications that depend on human-labeled data, especially from experts, or that depend on contextual knowledge for training data sets, the human-in-the-loop presents a serious bottleneck to the scalability of training efforts. Even if human labeling is generally feasible, sustained human performance and high-quality labels in larger quantities are challenging. Interactive Machine Learning can help solve usability problems in traditional machine learning by giving users agency in deciding how systems learn from data. Yet, the field lacks clear design guidelines for such interfaces, specifically regarding the scaling of training processes. In this paper, we present results from a pilot study in which participants interacted with several interface variants of a recommender engine and evaluated them on interaction and efficiency parameters. Based on the performance of these different learning system implementations we propose design guidelines for the design of such systems and a score for comparative evaluation, in which we combine interaction experience and system learning efficiency into one relative scoring unit.
Open-domain chatbots engage in natural conversations with the user to socialize and establish bonds. However, designing and developing an effective open-domain chatbot is challenging. It is unclear what qualities of such chatbots most correspond to users’ expectations. Even though existing work has considered a wide range of aspects, some key components are still missing. More importantly, the consistency and validity of the combined criteria have not been tested. In this paper, we describe a large-scale survey using a consolidated model to elicit users’ preferences, expectations, and concerns. We apply structural equation modeling methods to further validate the data collected from the user survey. The outcome supports the consistency, validity, and reliability of the model, which we call PEACE (Politeness, Entertainment, Attentive Curiosity, and Empathy). PEACE, therefore, defines the key determinants most predictive of user acceptance. This has allowed us to develop a set of implications useful for the development of compelling open-domain chatbots.
Building a perfect knowledge base in a certain domain is practically impossible, so it is effective for dialogue systems to acquire knowledge for enhancing an imperfect knowledge base through natural language dialogues with users. This paper proposes a framework for selecting questions for such knowledge acquisition when a knowledge graph is used as the knowledge base. The framework uses knowledge graph completion (KGC) for predicting new links that are likely to be correct and selects questions on the basis of the KGC scores. One of the problems with this framework is that questions with incorrect content might be selected, which often occurs when the link prediction performance is low, and this would reduce the users’ willingness to engage in dialogues. To alleviate this problem, this paper presents two modifications to the KGC training: 1) creating pseudo entities having substrings of the names of the entities in the graph so that the entities whose names share substrings are connected and 2) limiting the range of negative sampling. Cross validation-based experiments we conducted showed that these modifications improved KGC performance. We also conducted a user study with crowdsourcing to investigate the subjective perception of the correctness of the predicted links. The results suggest that the model trained with the modifications is capable of avoiding questions with incorrect content.
Sexual harassment (SH) incidents are increasing and call into question the effectiveness of traditional SH prevention training. In this paper, we introduce a proof-of-concept design of a conversational interface (CI) for understanding SH cases. Key features of the interface include that it engages the learner in a dyadic conversation, prompts the learner for guidance, and tells a story of SH from a first-person perspective. From a mixed-methods study (N=32), learners experiencing a SH vignette using the conversational interface reported feeling less overwhelmed with the content, more engaged with the situation, and more comfortable discussing the topic compared to reading the same vignette online. Participants also reported that using a first-person narrative made the vignette feel realistic and relatable. However, there was no difference in empathy between the conditions. We discuss these results and implications for designing effective SH prevention training.
Patient’s understanding on forthcoming dental surgeries is required by patient-centered care and helps reduce anxiety. Due to the complexity of dental surgeries and the patient-dentist expertise gap, conventional techniques of patient education are usually not effective for explaining surgical steps. In this paper, we present OralViewer—the first interactive application that enables dentist’s demonstration of dental surgeries in 3D to promote patients’ understanding. OralViewer takes a single 2D panoramic dental X-ray to reconstruct patient-specific 3D teeth structures, which are then assembled with registered gum and jaw bone models for complete oral cavity modeling. During the demonstration, OralViewer enables dentists to show surgery steps with virtual dental instruments that can animate effects on a 3D model in real-time. A technical evaluation shows that our deep learning model achieves a mean Intersection over Union (IoU) of 0.771 for 3D teeth reconstruction. A patient study with 12 participants shows OralViewer can improve patients’ understanding of surgeries. A preliminary expert study with 3 board-certified dentists further verifies the clinical validity of our system.
Coping with stress is critical to mental health. Prolonged mental stress is the psychological and physiological response to a high frequency of or continuous stressors, which has a negative impact on health. This paper presents a virtual stress management training using biofeedback derived from the cardiovascular response of the heart rate variability (HRV) with an interactive social agent as biofeedback trainer. The evaluation includes both, a subject-matter expert interview and an experiment with 71 participants. In the experiment, we compared our novel stress management training to a stress management training using stress diaries. The results indicate that our social agent-based stress management training using biofeedback significantly decreased the self-assessed stress levels immediately after the training, as well as in a socially stressful task. Moreover, we found a significant correlation between stress level and the assessment of one’s performance in a socially stressful task. Participants that received our training assessed their performance higher than participants getting stress diaries. Taken this together, our novel virtual stress management training with an interactive social agent as a trainer can be evaluated as a valid method for learning techniques on how to cope with stressful situations.
Nearly half of people prescribed medication to treat chronic or short-term conditions do not take their medicine as prescribed. This leads to worse treatment outcomes, higher hospital admission rates, increased healthcare costs, and increased morbidity and mortality rates. While some instances of medication non-adherence are a result of problems with the treatment plan or barriers caused by the health care provider, many are instances caused by patient-related factors such as forgetting, running out of medication, and not understanding the required dosages. This presents a clear need for patient-centered systems that can reliably increase medication adherence. To that end, in this work we describe an activity recognition system capable of recognizing when individuals take medication in an unconstrained, real-world environment. Our methodology uses a modified version of the Bagging ensemble method to suit unbalanced data and a classifier trained on the prediction probabilities of the Bagging classifier to identify when individuals took medication during a full-day study. Using this methodology we are able to recognize when individuals took medication with an F-measure of 0.77. Our system is a first step towards developing personal health interfaces that are capable of providing personalized medication adherence interventions.
Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML) models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in a form readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts in sharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to distill and share their domain knowledge using domain concept extractors and five types of label justification over a representative data sample. The design of Ziva is informed by preliminary interviews with data scientists, in order to understand current practices of domain knowledge acquisition process for ML development projects. To assess our design, we run a mix-method case-study to evaluate how Ziva can facilitate interaction between domain experts and data scientists. Our results highlight that (1) domain experts are able to use Ziva to provide rich domain knowledge, while maintaining low mental load and stress levels; and (2) data scientists find Ziva’s output helpful for learning essential information about the domain, offering scalability of information, and lowering the burden on domain experts to share knowledge. We conclude this work by experimenting with building NLP models using the Ziva output for our case study.
Designing natural language interfaces for querying databases remains an important goal pursued by researchers in natural language processing, databases, and HCI. These systems receive natural language as input, translate it into a formal database query, and execute the query to compute a result. Because the responses from these systems are not always correct, it is important to provide people with mechanisms to assess the correctness of the generated query and computed result. However, this assessment can be challenging for people who lack expertise in query languages. We present Debug-It-Yourself (DIY), an interactive technique that enables users to assess the responses from a state-of-the-art natural language to SQL (NL2SQL) system for correctness and, if possible, fix errors. DIY provides users with a sandbox where they can interact with (1) the mappings between the question and the generated query, (2) a small-but-relevant subset of the underlying database, and (3) a multi-modal explanation of the generated query. End-users can then employ a back-of-the-envelope calculation debugging strategy to evaluate the system’s response. Through an exploratory study with 12 users, we investigate how DIY helps users assess the correctness of the system’s answers and detect & fix errors. Our observations reveal the benefits of DIY while providing insights about end-user debugging strategies and underscore opportunities for further improving the user experience.
Online research is a frequent and important activity people perform on the Internet, yet current support for this task is basic, fragmented and not well integrated into web browser experiences. Guided by sensemaking theory, we present ForSense, a browser extension for accelerating people’s online research experience. The two primary sources of novelty of ForSense are the integration of multiple stages of online research and providing machine assistance to the user by leveraging recent advances in neural-driven machine reading. We use ForSense as a design probe to explore (1) the benefits of integrating multiple stages of online research, (2) the opportunities to accelerate online research using current advances in machine reading, and (3) the opportunities to support online research tasks under the presence of imprecise machine suggestions. In our study, we observe people performing online research tasks, and see that they benefit from ForSense’s integration and machine support for online research. From our study, we derive and share key recommendations for designing and supporting imprecise machine assistance for research tasks.