AI's Hidden Vulnerability: The Persistent Threat of Package Hallucinations in Code-Generating LLMs

Explore how frontier AI models still hallucinate non-existent software packages, creating "slopsquatting" supply chain risks. Understand the latest findings, universal attack surfaces, and enterprise mitigation strategies.

AI's Hidden Vulnerability: The Persistent Threat of Package Hallucinations in Code-Generating LLMs

The Shrinking Range of LLM Hallucinations, The Enduring Threat

      In the rapidly evolving landscape of artificial intelligence, code-generating Large Language Models (LLMs) have become indispensable tools for developers worldwide. These powerful AI assistants can churn out code snippets, suggest functions, and even recommend software packages for various tasks. However, a critical security vulnerability persists: LLMs frequently "hallucinate" package names—suggesting software packages that simply do not exist on popular registries like PyPI for Python or npm for JavaScript. This phenomenon, termed "slopsquatting," creates a silent, yet potent, supply chain attack surface that could compromise entire software projects. While recent advancements have narrowed the range of such hallucinations across different models, the underlying threat remains alarmingly present.

Understanding the Slopsquatting Threat

      Modern software development relies heavily on centralized package registries, which host millions of libraries readily installable with a single command. PyPI and npm are prime examples, facilitating rapid development but also introducing potential vulnerabilities if not properly managed. The introduction of code-generating LLMs has added a new layer of complexity to this ecosystem. When a developer prompts an LLM for code, the model might include `pip install` or `npm install` directives, or `import` statements referencing specific packages. Should a developer trust and install a hallucinated package name, an attacker who has registered a malicious package under that very name can successfully inject malware into the supply chain. This occurs without direct interaction with the developer or the LLM provider, making it a particularly insidious form of attack.

      The foundational research by Spracklen et al. (USENIX Security '25) initially characterized this risk in 2024, analyzing a cohort of 16 LLMs. Their study revealed concerning hallucination rates, with commercial models averaging 5.2% and open-source models reaching as high as 21.7% non-existent package suggestions. This work identified over 200,000 unique hallucinated names, highlighting a significant attack vector that adversaries could exploit through "slopsquatting"—registering malicious packages using names that LLMs inadvertently promote. For more detailed insights into the initial characterization of this threat, the original research can be found at Spracklen et al. (USENIX Security '25).

Frontier Models: Improved Consistency, Persistent Risk

      Eighteen months after Spracklen's initial findings, a new study led by Aleksandr Churilov (source: The Range Shrinks, the Threat Remains) re-evaluated this phenomenon across five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. These newer models boast substantially expanded training data and enhanced safety protocols. Replicating the original methodology, the study generated nearly 200,000 paired Python and JavaScript prompts, validating every suggested package against comprehensive PyPI and npm master lists.

      The key finding is a significant compression in the inter-model hallucination rate spread. While Spracklen observed a 16.5 percentage point difference (5.2%–21.7%), the new study measured a much tighter range of 4.62% (Claude Haiku 4.5) to 6.10% (GPT-5.4-mini)—an eleven-fold narrowing to just 1.48 percentage points. This indicates that LLMs are becoming more consistent in their hallucination behavior. However, the average rate has not fallen to zero; the threat of slopsquatting persists, and the absolute numbers remain substantial given the scale of AI code generation.

Unmasking a Universal Attack Surface

      Beyond simply replicating the previous findings, the 2026 study unveiled a critical new dimension to the slopsquatting threat: a "universal-hallucination set." Researchers identified 127 package names (109 on PyPI and 18 on npm) that were identically hallucinated by all five of the evaluated frontier LLMs. This set represents a model-agnostic supply-chain attack surface, meaning that regardless of which of these advanced LLMs a developer uses, there's a shared risk of being prompted to install these specific non-existent packages.

      This discovery is particularly alarming because it signifies a collective blind spot in leading AI models, potentially stemming from shared patterns in their vast training datasets or similar emergent behaviors. Such a universal attack surface cannot be detected by studies focusing on single models, underscoring the importance of comprehensive cross-model analysis. The implications for enterprise security are profound; it suggests a standardized set of phantom packages that could be exploited, demanding a coordinated disclosure protocol with package registry security teams to preemptively mitigate the risk.

Unexpected Asymmetries and Model Behavior

      The study also highlighted several interesting shifts in LLM behavior compared to the 2024 baseline:

  • Python-over-JavaScript Asymmetry: Across all five frontier models, a Python-over-JavaScript hallucination asymmetry was observed. This inverts Spracklen's 2024 finding, where JavaScript packages were found to be more susceptible to hallucination. This shift could indicate changes in training data emphasis or architectural improvements.
  • Anthropic Family Inversion: Within Anthropic's own model family, Claude Haiku 4.5 (at 4.62%) was found to hallucinate measurably less than Claude Sonnet 4.6 (at 5.41%). This contradicts the typical pattern where smaller models within a family tend to exhibit higher hallucination rates. This unexpected inversion suggests specific optimization or fine-tuning strategies unique to Haiku's development.
  • Shared Training Data Indicators: The study noted a significant Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) in their hallucinated package names. This similarity is highly suggestive of potential shared training-data origins between these otherwise distinct models, or perhaps a convergence in their learning processes when generating code for these registries.


Mitigating the Risk in Enterprise AI Deployments

      The persistence of package hallucinations, even with improved model consistency, underscores the critical need for robust security measures in any enterprise adopting AI for code generation. While LLMs offer immense productivity benefits, relying solely on their suggestions without validation introduces unacceptable risks. Organizations must implement strict vetting processes for all third-party dependencies, whether human-suggested or AI-generated.

      For businesses leveraging AI and IoT solutions, especially in sensitive or regulated environments, the choice of deployment architecture becomes paramount. On-premise AI solutions, like those provided by ARSA Technology, offer enhanced control over data flow and processing, which can be crucial in mitigating supply chain risks. Products such as ARSA's AI Video Analytics software or the ARSA AI Box Series are designed for on-site processing, ensuring that sensitive data and critical operational intelligence remain within the enterprise’s secure network, reducing external dependencies and potential attack surfaces. Furthermore, ARSA's commitment to delivering production-ready systems, a philosophy we have cultivated since we were experienced since 2018, emphasizes accuracy, scalability, and operational reliability – crucial elements in countering novel threats like slopsquatting.

      Automated tools for dependency analysis, static code analysis, and dynamic application security testing (DAST) can help identify and flag suspicious package names or unverified dependencies. Developers should also be trained to cross-reference any AI-generated package suggestions with official registry documentation before installation. The future of secure AI-assisted coding will depend on a multi-layered approach combining advanced AI safety research, robust developer practices, and secure deployment models.

      To learn more about secure, reliable AI solutions and how they can safeguard your operations against emerging threats, feel free to contact ARSA for a free consultation.

      Source:

      Churilov, Aleksandr. (2026). The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier- Model Cohort. arXiv preprint arXiv:2605.17062. Available at: https://arxiv.org/abs/2605.17062