AI evaluation - Machine State | ARSA Technology

Machine State | ARSA Technology

Sign in Subscribe

AI evaluation

A collection of 11 posts

Unmasking AI Bias: How Stylistic Manipulation Attacks Threaten LLM Judges

Unmasking AI Bias: How Stylistic Manipulation Attacks Threaten LLM Judges

Explore BITE, a black-box adversarial framework that exploits stylistic biases in LLM judges to inflate scores, revealing critical vulnerabilities in AI evaluation and highlighting the need for robust defenses.

Ensuring Fair Play: Decontaminating Benchmarks for Multiple Large Language Models with JECS

LLM benchmarking

Ensuring Fair Play: Decontaminating Benchmarks for Multiple Large Language Models with JECS

Discover how Joint Envelope Conformal Selection (JECS) provides a provable method to create reliable, decontaminated benchmarks for comparing multiple Large Language Models, enhancing trust in AI evaluation.

Advancing AI Evaluation: Fine-Grained Benchmarks for Foundation Models

Advancing AI Evaluation: Fine-Grained Benchmarks for Foundation Models

Discover FLAME, a groundbreaking framework that generates comprehensive, fine-grained benchmarks for foundation models. Move beyond aggregate scores to pinpoint AI strengths and weaknesses, enhancing model selection and development for enterprises.

Unmasking True Reliability: The Challenge of Detecting Hallucinations in Enterprise LLMs

LLM hallucination detection

Unmasking True Reliability: The Challenge of Detecting Hallucinations in Enterprise LLMs

Discover how "PARALLAX" reveals critical flaws in LLM hallucination benchmarks and proposes new methods like DRIFT for genuine detection, crucial for trustworthy enterprise AI deployments.

Enhancing Generative AI: Cultivating Cultural Appropriateness with Community-Informed Evaluation

Enhancing Generative AI: Cultivating Cultural Appropriateness with Community-Informed Evaluation

Discover how integrating community-informed rubrics can elevate generative AI's cultural representation. Learn about ethical AI development, the MLLM-as-a-judge approach, and the importance of lived-experience expertise in shaping AI evaluation for global enterprises.

AI Persona Prompting: Unmasking Hidden Performance and Benchmark Validity in Large Language Models

AI persona prompting

AI Persona Prompting: Unmasking Hidden Performance and Benchmark Validity in Large Language Models

Explore how expert personas enhance AI performance, debunking misconceptions from flawed studies. Discover critical insights into benchmark validity and the future of enterprise AI evaluation.

The Hidden Mathematical Flaws Undermining Your AI Agent's Reliability

AI Agent Reliability

The Hidden Mathematical Flaws Undermining Your AI Agent's Reliability

Explore the mathematical challenges behind AI agent failures, including compounding errors, non-determinism, and state management, and learn how to build resilient enterprise AI.

AI's Unwavering Judgment: How Automated Answer Matching Resists Manipulation

AI's Unwavering Judgment: How Automated Answer Matching Resists Manipulation

Discover how AI-powered answer matching ensures reliable evaluations for businesses, resisting common text manipulation tactics and offering a robust alternative to human review.

Enhancing Generative AI Evaluation: The Power of Efficient LLM-as-a-Judge Calibration for Businesses

Enhancing Generative AI Evaluation: The Power of Efficient LLM-as-a-Judge Calibration for Businesses

Discover advanced statistical methods like Prediction-Powered Inference (PPI) and EIF for robust LLM-as-a-judge evaluation, ensuring accurate and efficient assessment of generative AI outputs for enterprise.

Beyond Harmful: The Crucial Need for Fine-Grained AI Evaluation in Enterprise LLMs

Beyond Harmful: The Crucial Need for Fine-Grained AI Evaluation in Enterprise LLMs

Discover why traditional AI evaluation overestimates Large Language Model (LLM) jailbreak success. Learn how ARSA Technology leverages fine-grained analysis for safer, more effective enterprise AI.

Unlocking Business Efficiency: The New Era of Practical AI Language Models for Enterprises

AI writing tools

Unlocking Business Efficiency: The New Era of Practical AI Language Models for Enterprises

Discover how a new evaluation framework, WRAVAL, highlights the power of Small Language Models for practical business applications like writing assistance, improving efficiency, and data privacy.