Detecting Unexpected AI Behavior from Human Cues

Developing multimodal datasets and benchmarking models, including large language models, to detect user responses to unexpected AI behavior.

Modern AI-powered in-vehicle systems, from autonomous driving assistants to conversational agents, can behave in ways that violate user expectations. These mismatches often trigger subtle but important emotional responses—surprise, confusion, or frustration—that directly influence trust, usability, and adoption. This project develops methods and datasets to automatically detect such responses from human cues.

Driving Simulator Study

We conducted a within-participants study (N=30) using a Unity-based driving simulator where participants rode in a fully autonomous vehicle while engaging in a secondary word puzzle task. Participants experienced three crafted interaction scenarios—music requests, food ordering, and safety alerts—with unexpected system behaviors deliberately designed to elicit surprise, confusion, or frustration. Three synchronized cameras (wheel, rear-mirror, and A-pillar), audio channels, and heart rate sensors recorded multimodal data. A stimulus check confirmed that the manipulations successfully induced the intended emotional responses.

Participants’ facial expressions, voice, and head pose revealed distinct markers of surprise, confusion, and frustration. These multimodal signals can enable adaptive AI systems to respond in real time.

Dataset and Analysis

The study produced a validated multi-modal, multi-camera dataset (DRIVE) containing over 1000 annotated interactions. Analyses revealed clear distinctions in self-reports and facial features:

  • Surprise: brief high unexpectedness, low confusion/frustration.
  • Confusion: sustained unexpectedness with ambiguity, moderate frustration.
  • Frustration: repeated unmet goals, high frustration intensity.

Key facial action units (AUs), such as AU25 (Lips Part), AU10 (Upper Lip Raiser), AU23 (Lip Tightener), and AU45 (Blink), reliably distinguished between the three states. Audio features (MFCCs) added complementary signals.

Benchmarking Models

We benchmarked several approaches for automatic classification of user responses:

  • XGBoost (using aggregated AU and audio features) performed best on accuracy, F1, and precision-recall.
  • LSTMs and Transformers (using raw temporal sequences) captured sequential dependencies but lagged behind tree-based methods on tabular-like features.
  • Average detection rate was highest for Transformer models, indicating stronger temporal sensitivity.

Crucially, combining facial and audio features consistently improved performance across methods.

Large Language Models (LLMs)

We also explored multimodal large language models (MLLMs), providing synchronized video frames, facial features, and audio features as input. GPT-5, prompted with few-shot examples, classified user states into neutral, surprise, confusion, and frustration. Beyond raw accuracy, LLMs offered interpretable reasoning about observed behaviors, suggesting future directions for explainable affect detection.

Broader Implications

The results show that systems can go beyond binary error detection to differentiate fine-grained user states. This opens the door to adaptive strategies:

  • Explaining actions when detecting confusion.
  • Offering alternative options when detecting frustration.
  • Timing interventions based on surprise.

This project contributes a publicly validated dataset, systematic benchmarking of popular models, and first steps in applying multimodal LLMs for affect detection in autonomous vehicles and other interactive AI systems.

Building on these findings, we are developing a real-time event detection system that continuously monitors human behavioral signals such as facial expressions, audio prosody, and physiological indicators, to identify moments when users perceive AI behavior as unexpected. To capture the nuanced and temporally extended nature of human responses, we are exploring temporal models, including RNN-based architectures and Transformer models, that can fuse multimodal inputs and detect subtle shifts in user state over time. This cross-modal detector will enable AI systems—particularly in safety-critical or high-trust domains—to adapt their explanations, interventions, or behaviors dynamically, based on the user’s inferred awareness and emotional context (missing reference).

This line of research advances human-centered AI by improving not only how systems explain themselves, but also when and why they choose to explain, anchored in real-time understanding of user mental and emotional states.

References