Overview
Benchmark model on robot failure detection and categorization from users' facial reactions, accounting for inter-dependent interaction sessions and modeling longitudinal user adaptation across repeated failures. AUC 0.90 across 41 participants.
Dataset
Multimodal dataset of crash cart robot failures. The crash cart task: user asks "where is item?", robot should indicate drawer and lights. Failures are intentional (timing delay, speech/incorrect answer, comprehension, search). Each participant completes 5 trials (1 success + 4 failure types) in random order. Data includes MediaPipe features (AU, gaze, landmarks, pose), metadata, survey responses, and transcripts.
Key Contribution
Trial order affects reactions (first failure triggers surprise, fourth triggers frustration). No prior work models this cross-trial history dependence computationally. The history-aware HRNN architecture processes previous trial embeddings and labels to provide session context.
Statistical Analysis
Analyzed 214 embodied HRI sessions with feature engineering over 4,000+ verbal and nonverbal features. Repeated-measures statistical tests (permutation tests, GEE, Friedman) to analyze robot failures and recovery strategies that restore user trust.
