Unpacking AI Explanations: Why Voice Outperforms Text for Building User Trust
Explore a new information-theoretic framework comparing voice vs. text for AI explainability. Discover how multimodal delivery enhances user comprehension and trust calibration in enterprise AI solutions.
The Untapped Potential of Multimodal AI Explanations
Explainable Artificial Intelligence (XAI) is paramount for fostering transparency and trust in machine learning systems. As AI becomes increasingly integrated into critical business operations, understanding why an AI makes a particular decision is no longer a luxury but a necessity. While current XAI methods predominantly rely on visual aids and text-based explanations, a recent academic paper sheds light on a crucial, often overlooked aspect: the modality of explanation delivery. Researchers Mona Rajhans and Vishal Khawarey, in their paper “An Information-Theoretic Framework for Comparing Voice and Text Explainability,” accepted for publication at the 10th ACM International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence (ISMSI 2026), propose a novel framework to analyze how voice versus text explanations impact user comprehension and their calibrated trust in AI.
The core challenge lies in the fact that human communication is inherently multimodal. We don't just read; we listen, observe, and interact. Yet, AI explanations have largely ignored the power of auditory communication. This oversight is significant, especially in high-stakes environments like financial advising or healthcare diagnostics, where the way an explanation is presented can profoundly influence critical human decisions and outcomes. The framework developed in this research provides a quantitative lens through which to evaluate these modality effects, modeling explanation delivery as an information transmission process and defining key metrics to assess its quality.
Bridging the Modality Gap in AI Explainability
The existing landscape of XAI has primarily focused on generating explanations that can be seen or read. This includes techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), which attribute the importance of input features to an AI's prediction. However, as noted by prior research from Akman and Schuller, audio explanations remain underexplored, despite their potential to be more intuitive and expressive for end-users. The intuitive appeal of voice—its ability to convey nuance and emotion—suggests it could significantly enhance user engagement and trust, areas where textual explanations might fall short.
Consider a scenario where an AI system flags a potential safety hazard in a manufacturing plant, or a suspicious financial transaction. A quickly delivered, clear voice explanation might prompt immediate action and instill confidence faster than reading a detailed text report. Conversely, a complex analytical task might benefit from the ability to review and re-read textual explanations at one's own pace. This highlights the need for a deeper understanding of how different modalities interact with the fidelity of explanations and the cognitive load they place on the user. Understanding this dynamic is crucial for AI Video Analytics solutions, for instance, where real-time situational awareness is critical and explanations need to be both fast and trustworthy.
Quantifying Explanation Quality: An Information-Theoretic Approach
To address this gap, the researchers formalized the process of delivering XAI content as an information-transmission problem. This involves treating the AI's core explanation (represented as an "attribution vector" showing feature importance, like SHAP or LIME values) as the source, and the user's understanding as the destination. The explanation is encoded into a message, influenced by the chosen modality (voice or text) and style (brief, detailed, or analogy-based). The user then forms a mental representation of this explanation, constrained by their cognitive capacity.
The framework introduces three key quantitative metrics:
- Information Retention (I_M): This metric measures how much of the original, true information from the AI's explanation is preserved in the user's understanding. A higher value indicates better fidelity and less distortion in the message.
- Cognitive Load (L): This quantifies the mental effort required by the user to process the explanation. It considers factors like the duration of the explanation (e.g., audio length or word count) and the conceptual complexity of the message. A higher load can lead to decreased comprehension.
- Comprehension Efficiency (CE): Defined as the ratio of retained information to cognitive load, CE assesses how much useful information a user gains per unit of mental effort. A higher CE means the explanation is more efficient at conveying understanding.
- Trust Calibration Error (TCE): This metric measures the discrepancy between a user's trust in the AI's prediction and the objective correctness of that prediction. Ideally, user trust should perfectly align with the AI's reliability; a smaller TCE indicates better-calibrated trust. An overly high or low trust relative to actual performance would result in a larger TCE.
Finally, to offer a holistic evaluation, the researchers introduced an Overall Evaluation Function (Φ), which combines Comprehension Efficiency and Trust Calibration Error into a single composite score. This allows for identifying the optimal modality-style combination that offers the best trade-off between how well users understand the AI and how appropriately they trust it. This detailed evaluation mechanism provides robust metrics for any organization seeking to enhance human-AI interaction in their solutions, such as those leveraging ARSA AI Box Series for various edge AI applications.
Key Findings: Text for Comprehension, Voice for Trust
The simulation framework, implemented in Python using synthetic SHAP-based feature attributions, yielded insightful results by evaluating these metrics across different modality-style configurations. The findings underscore the distinct strengths of each explanation modality:
- Text excels in Comprehension Efficiency: Text-based explanations consistently achieved higher Comprehension Efficiency. This suggests that for scenarios requiring deep analytical reasoning, detailed information processing, or the ability to review facts at length, text remains the superior choice. Users can pause, re-read, and absorb complex data more effectively when presented in written form, leading to a higher return on their cognitive investment.
- Voice enhances Trust Calibration: Conversely, voice explanations demonstrated improved trust calibration (lower Trust Calibration Error). This indicates that spoken explanations help users align their trust levels more accurately with the AI's actual correctness. The inherent expressiveness of voice, often conveying nuances beyond mere words, can foster a stronger sense of engagement and perceived trustworthiness, leading to more appropriate reliance on the AI's outputs, even if the raw "information density" is lower than text.
- Analogy-Based Explanations Offer the Best Trade-Off: Perhaps the most compelling finding was that analogy-based delivery, regardless of modality, achieved the best overall trade-off when evaluating the composite score (Φ). This suggests that explaining complex AI decisions by relating them to familiar concepts or scenarios significantly aids in both comprehension and appropriate trust calibration. Analogies help bridge the gap between abstract AI reasoning and human intuition, making AI more accessible and understandable.
These results offer a powerful roadmap for enterprises developing and deploying AI solutions, emphasizing that effective XAI is not a one-size-fits-all endeavor. For instance, a Self-Check Health Kiosk providing a basic health risk assessment might benefit from an analogy-based voice explanation to build quick trust, while a detailed diagnostic report would still require textual clarity.
Practical Implications for Enterprise AI Deployment
The findings of this framework have profound practical implications for businesses integrating AI into their operations, particularly for global enterprises. Designing truly transparent and trustworthy AI systems requires a strategic approach to how explanations are delivered:
- Tailored XAI Strategies: Companies should move beyond generic textual explanations and adopt a multimodal XAI strategy. For high-volume, quick-decision tasks (e.g., security alerts, simple anomaly detection), concise voice explanations might be ideal. For complex compliance reports, detailed textual breakdowns are essential. The flexibility to choose the right modality can significantly impact operational efficiency and risk management.
- Prioritizing Trust in High-Stakes Scenarios: In applications where user trust is paramount—such as autonomous systems, financial fraud detection, or critical infrastructure monitoring—integrating voice-based explanations could be a game-changer. The ability of voice to foster better trust calibration means users are more likely to rely on the AI appropriately, reducing both over-reliance and under-reliance, which can have significant safety and financial implications.
- Leveraging Analogy-Based Explanations: The power of analogies points to a need for AI designers to consider pedagogical approaches in their explanation strategies. Simplifying complex AI logic into relatable human concepts can dramatically improve how users understand and accept AI outputs. This could involve using examples, metaphors, or comparative scenarios to clarify why an AI made a particular decision.
- Enhanced Training and Adoption: For organizations deploying new AI tools, multimodal explanations can also enhance user training and adoption. By reducing cognitive load and improving comprehension efficiency, employees can learn to interact with and trust AI systems faster, leading to quicker ROI on AI investments and smoother digital transformation initiatives. This is particularly relevant for diverse global workforces, where language and cultural nuances might make analogy-based explanations even more potent.
Building Trustworthy AI: The Future of Multimodal Explainability
This information-theoretic framework represents a crucial step toward building more human-centric AI systems. It moves beyond merely what an AI explains to deeply consider how those explanations are communicated, acknowledging the complex cognitive processes of human users. As AI systems continue to evolve in sophistication and autonomy, the ability to effectively communicate their reasoning will be paramount for their successful integration into society and industry. The proposed framework provides a robust, reproducible foundation for evaluating and optimizing these critical human-AI communication channels.
Future research, building on this simulation, can involve empirical studies using real-world data and user groups, further validating and refining these insights. For global enterprises, adopting such a nuanced approach to XAI can significantly reduce operational costs, enhance security, and cultivate the deep trust required for widespread AI adoption. ARSA Technology is committed to delivering advanced AI and IoT solutions that prioritize transparency, trust, and practical business outcomes.
Discover how ARSA’s AI and IoT solutions can be tailored to meet your unique needs and enhance trust in your enterprise AI applications. Request a free consultation with our experts today.
Source: Rajhans, M., & Khawarey, V. (2026). An Information-Theoretic Framework for Comparing Voice and Text Explainability. 10th ACM International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence (ISMSI 2026). https://arxiv.org/abs/2602.07179