Unpacking AI's Role in Peer Review: Do LLMs Favor Their Own?

Explore a comprehensive analysis of LLM use in scientific peer review, revealing insights into interaction effects, rating biases, and the critical role of human oversight.

Unpacking AI's Role in Peer Review: Do LLMs Favor Their Own?

The AI Influence on Scientific Peer Review

      The advent of large language models (LLMs) like ChatGPT has profoundly reshaped research workflows, with a significant majority of researchers now incorporating these tools into their daily tasks. This integration extends beyond just drafting papers; LLMs are increasingly becoming part of the crucial peer review process itself. As AI-generated content becomes more prevalent in both submissions and evaluations, a critical question arises: how do LLM-assisted papers and LLM-assisted reviews interact, and does this interaction introduce any systematic biases?

      Recent studies indicate a notable surge in LLM adoption. For instance, at major machine learning conferences, evidence suggests up to 17% of reviews show signs of LLM modification, and a similar percentage of computer science paper abstracts incorporate LLM-generated text (Sharma et al., 2026, citing Liang et al., 2026). This trend, particularly sharp in review writing, has sparked concerns about the integrity of the peer review process, leading to sentiments of distrust among some researchers. Understanding these complex dynamics is essential for developing policies that ensure fairness and quality in academic publishing.

Unpacking the "LLM Bias": Initial Observations and Deeper Insights

      Initial observations suggested what appeared to be a systematic interaction effect: LLM-assisted reviews seemed particularly favorable towards LLM-assisted papers, showing them more leniency compared to submissions with minimal LLM involvement. This raised immediate questions about potential "AI-to-AI" preferential treatment, where LLM-generated content might be evaluated more kindly by LLM-assisted reviewers. Such a bias, if genuine, could have profound implications for academic fairness and the scientific merit of accepted papers.

      However, a deeper, causally grounded analysis that controls for the inherent quality of the papers reveals a different story. The apparent kindness of LLM-assisted reviews towards LLM-assisted papers is largely a spurious effect. Instead, LLM-assisted reviews tend to be more lenient towards lower-quality papers in general. The reason for the perceived favoritism is that LLM-assisted papers are disproportionately found among submissions that are objectively weaker (Sharma et al., 2026). This finding highlights the critical importance of controlling for confounding factors in observational studies of AI behavior, ensuring that conclusions are based on genuine causal relationships rather than superficial correlations.

The Nuances of LLM Involvement: From Assistance to Full Generation

      The study further distinguishes between "LLM-assisted" reviews, where humans leverage LLMs as tools, and "fully LLM-generated" reviews, which have minimal human oversight. This distinction proves crucial in understanding the impact of AI on review quality. Fully LLM-generated reviews exhibit a severe "rating compression," meaning they struggle to differentiate between papers of varying quality, often assigning middling scores irrespective of a submission's true merit. This inherent flaw compromises their utility in a discerning peer review process.

      Crucially, the research found that when human reviewers actively incorporate LLMs into their process (i.e., LLM-assisted reviews), this rating compression is substantially reduced (Sharma et al., 2026). This suggests that human judgment acts as a vital moderator, guiding the LLM to produce more nuanced and discriminatory feedback. This echoes ARSA Technology's philosophy that AI tools are most effective when deployed with robust human oversight and in a way that augments human capabilities rather than replaces them entirely. Our ARSA's AI & IoT solutions are designed to provide actionable insights and enhance decision-making, always with a human expert in the loop.

AI in Decision-Making: The Role of LLMs in Metareviews

      Beyond individual reviews, the study also examined the role of LLMs in metareviews – the overarching assessments that often lead to final acceptance or rejection decisions. Interestingly, LLM-assisted metareviews were found to be more likely to render "accept" decisions than purely human metareviews, even when reviewer scores were equivalent. This might initially suggest another form of leniency or bias introduced by LLMs at a higher decision-making level.

      However, a comparison with fully LLM-generated metareviews revealed that these tend to be harsher than their human-driven counterparts (Sharma et al., 2026). This counterintuitive finding implies that human meta-reviewers are not simply offloading their decision-making to the LLM. Instead, they are likely using LLMs as sophisticated tools to synthesize information, flag potential issues, or generate initial drafts, with the ultimate judgment and contextual nuance still provided by the human expert. This underscores a future where AI facilitates, rather than dictates, critical human decisions.

      The data shows a clear and accelerating trend in LLM adoption within the academic sphere. At conferences like ICLR, LLM use in reviews has almost doubled in a single year, significantly outstripping its use in paper writing. For example, in the latest ICLR editions, over 26% of reviews showed substantial LLM modification, compared to just over 3% of papers (Sharma et al., 2026). This disparity highlights a greater comfort or perceived utility for LLMs in crafting critiques than in generating original research content.

      The characteristics of LLM-assisted content are also noteworthy. Papers with significant LLM assistance often receive lower average scores, whereas LLM-assisted reviews tend to concentrate around intermediate ratings, avoiding very harsh or very high scores. This "hedging" behavior in LLM-assisted reviews suggests a cautious approach when AI is involved, perhaps reflecting the inherent uncertainty in AI-generated text or a deliberate strategy by human users. These trends point to a future where policies, such as ICML's new requirement for LLM use declarations, become essential for navigating the complex interactions of AI within the review ecosystem (Sharma et al., 2026, citing ICML Policy, 2026).

Building Responsible AI Frameworks for Academic Integrity

      The findings from this comprehensive analysis provide crucial input for developing robust policies that govern LLM use in peer review. While LLMs offer undeniable benefits in streamlining aspects of the research workflow, their integration into high-stakes decision-making processes like peer review demands careful consideration. The study demonstrates that unmoderated LLM use can lead to issues like rating compression, undermining the very purpose of peer review—to rigorously evaluate and improve scientific contributions.

      For enterprises and academic institutions alike, the lessons are clear: AI is a powerful enhancer, but human expertise, critical judgment, and ethical frameworks remain paramount. As an organization experienced since 2018 in delivering AI and IoT solutions, ARSA Technology recognizes the need for responsible AI deployment. Our ARSA AI API products, for instance, are designed for enterprise-grade accuracy and responsible integration, emphasizing speed, security, and the ability to integrate seamlessly while maintaining human oversight where it matters most.

Conclusion: Navigating the Future of AI in Academia

      The integration of LLMs into scientific peer review presents both unprecedented opportunities and significant challenges. This study provides valuable clarity, demonstrating that while LLM-assisted reviews may appear to favor LLM-assisted papers, this effect is largely spurious, driven by LLMs' general leniency towards weaker submissions. The research underscores the vital role of human intervention in mitigating the pitfalls of purely AI-generated evaluations and highlights how meta-reviewers skillfully leverage LLMs as tools, not decision-makers.

      As the scientific community continues to embrace AI, understanding these nuanced human-AI interaction effects is crucial for safeguarding the integrity and effectiveness of academic peer review. The insights gained pave the way for informed policies that harness AI's potential while upholding the highest standards of research quality.

      Source: Sharma, V., Joachims, T., & Dean, S. (2026). Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review. arXiv preprint arXiv:2601.20920.

      Discover how ARSA Technology builds intelligent, ethically-driven AI solutions that enhance human capabilities across various industries. To explore our offerings and discuss your specific needs, contact ARSA today for a free consultation.