AIR-FM: Assessing and Improving Reliability of Foundation Models in the Real World Workshop@AAAI 2026

Accepted Papers

  1. Task Interference in VLMs for Autonomous Driving: When Better Perception Hurts Planning.

  2. Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning.

  3. LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval.

  4. Next-Frame Prediction as a Reliability-Aware Training Paradigm for Robust Vision Encoders.

  5. Future Is Unevenly Distributed: Forecasting Ability Of LLMs Depends On What We’re Asking.

  6. Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models.

  7. Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction.

  8. AlignVQA: Debate-Driven Multi-Agent Calibration for Vision Language Models.

  9. Can In-Context Learning Defend against Backdoor Attacks to LLMs.

  10. VISOR: Visual Input based Steering for Output Redirection in Large Vision Language Models.

  11. Beyond Grey-Box Assumptions: Uncertainty-Guided Example Selection for Black-Box Language Models.

  12. Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models.

  13. Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models.

  14. SafeGen: Benchmarking Inference-Time Methods for Privacy-Preserving Text Generation.

  15. BLUFF-1000: Measuring Uncertainty Expression in RAG.

  16. Reasoning Models are Test Exploiters: Rethinking Multiple Choice.

  17. Optimizing Chain-of-Thought Confidence via Topological and Dirichlet Risk Analysis.

  18. Know Or Not: a library for systematically evaluating out-of-knowledge base robustness.

  19. Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls.

  20. Black-Box Uncertainty Quantification for Large Language Models via Ensemble-of-Ensembles.

  21. Prompt-Adaptive Quantization: Adaptive Per-Prompt Routing for Efficient LLM Inference.

  22. BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision Language Models.

  23. COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation.