AIR-FM: Assessing and Improving Reliability of Foundation Models in the Real World Workshop@AAAI 2026

Overview

Welcome to the AAAI 2026 Workshop! Despite remarkable advances in capability, foundation models such as LLMs and VLMs face fundamental challenges in maintaining reliability under real-world conditions. Their stochastic nature and sensitivity to context make them vulnerable to distribution shifts, sensor noise, hallucinations, overconfidence, and prompt variability. These issues limit safe deployment in critical domains like healthcare, law, robotics, and autonomous driving.

This workshop will serve as a forum for researchers and practitioners to discuss definitions, metrics, and methods for reliability quantification, explore principled evaluation frameworks, and propose strategies to enhance robustness and trustworthiness across language and vision tasks. By bridging the LLM and VLM communities, we aim to foster cross-domain insights, stimulate the creation of realistic stress-test datasets, and encourage approaches that ensure dependable performance in operational settings.

Topic of Interest:

We welcome original contributions from probabilistic machine learning, statistics, engineering, NLP, HCI, and related fields. Submissions may address (but are not limited to) the following topics:

  • Failure Mode Analysis: Characterising unreliability in LLMs and VLMs under real-world conditions, including domain shifts, adversarial inputs, and sensor degradations.

  • Reliability-Centered Datasets: Designing datasets to expose vulnerabilities, long-tail phenomena, or multi-modal inconsistencies.

  • Metrics and Evaluation Frameworks: Developing measures that capture robustness, calibration, and generalization beyond accuracy or average precision.

  • Reliability-Aware Architectures and Training: Model designs and learning paradigms that explicitly target dependable performance in realistic scenarios.

  • Uncertainty Estimation and Detection: Predicting, detecting, and mitigating unreliable outputs before deployment.

  • Security, Hallucination, and Prompt Sensitivity: Red-teaming, jailbreak detection, and methods to reduce context-driven unreliability.

  • Cross-Domain Reliability Insights: Lessons and techniques transferable between language-only, vision-only, and multi-modal systems.

For further information and should you have any inquiries, please contact: air-fm@googlegroups.com.