AI-Driven Precision: Revolutionizing Cybersecurity Risk Assessments with Semantic Labeling

Explore how AI and semantic labeling transform Third-Party Risk Assessment (TPRA) questionnaires, improving accuracy, reducing costs, and enhancing cybersecurity posture.

AI-Driven Precision: Revolutionizing Cybersecurity Risk Assessments with Semantic Labeling

The TPRA Challenge in the Digital Age

      In today's interconnected business landscape, organizations increasingly rely on a vast ecosystem of third-party providers, from cloud platforms to Software-as-a-Service (SaaS) vendors. This reliance, while enabling agility and innovation, introduces complex cybersecurity risks. To mitigate these threats and ensure regulatory adherence, Third-Party Risk Assessment (TPRA) has become a cornerstone of effective cybersecurity risk management. TPRA involves evaluating a supplier's security posture against established standards such as ISO/IEC 27001 and NIST. The process typically entails sending out structured questionnaires, evaluating responses, and making informed decisions on vendor engagement.

      While much effort is dedicated to analyzing supplier responses, a significant bottleneck often emerges much earlier: the manual and time-consuming task of selecting the most relevant questions for each assessment. Large enterprises accumulate extensive repositories of compliance questions over time, often unstructured and requiring laborious manual curation. This results in repetitive, inefficient, and difficult-to-scale question selection, hindering the overall speed and accuracy of risk assessments.

Beyond Keywords: The Need for Semantic Understanding

      Current automated approaches to question selection typically frame it as a retrieval problem, relying on keyword matching or surface-level textual similarity. While these methods can identify topically related questions, they frequently fall short in capturing the nuanced "semantic meaning" of an assessment. This deeper meaning encompasses both the underlying control domain (e.g., whether a question relates to "access control," "incident response," or "data encryption") and the specific assessment scope (e.g., whether it aims to verify the "existence" of a control or its "enforcement on critical systems").

      Without this explicit semantic understanding, retrieval systems often yield overly generic questions that, despite being topically relevant, do not align with the precise objectives of a particular assessment. This misalignment can lead to incomplete risk evaluations, wasted effort, and ultimately, an inaccurate picture of third-party cybersecurity risk. The goal is to move beyond simple keyword relevance to a more intelligent system that understands the intent behind each question and the overall assessment.

Traditional Approaches and Their Shortcomings

      Existing Third-Party Risk Management (TPRM) platforms offer curated question libraries and mappings to compliance frameworks, which are valuable for governance and auditability. However, these systems often depend on manually configured content and static libraries, leading to redundancy and inconsistent phrasing over time. Adapting these questionnaires to specific organizational needs or evolving threat landscapes still demands considerable manual effort, highlighting the need for more automated yet interpretable solutions.

      On the technical front, older lexical methods for retrieval struggle with variations in phrasing, while newer neural retrieval methods using dense vector embeddings can find semantically similar questions. Yet, as mentioned, they lack explicit models for assessment intent or scope. In the realm of text labeling, unsupervised methods like topic modeling or basic clustering face challenges with short compliance texts or fail to represent overlapping concepts. More advanced solutions, like those utilizing Large Language Models (LLMs) to generate human-readable labels, are powerful but come with significant costs, sensitivity to prompt design, and variability across different runs, making them difficult to scale for the massive and continuously evolving question repositories that large organizations maintain (Nour Eldin et al., 2026).

Introducing Semi-Supervised Semantic Labeling (SSSL)

      To address these limitations, researchers have explored a hybrid approach called Semi-Supervised Semantic Labeling (SSSL). This framework aims to enrich compliance repositories with meaningful semantic labels automatically, without incurring the high costs and complexities of solely relying on LLMs for every question. SSSL focuses on efficiently generating consistent and discriminative labels that accurately reflect a question's control domain and assessment scope. By combining the power of LLMs with more traditional machine learning techniques, SSSL offers a pathway to scalable and cost-effective semantic labeling.

      The core idea is to leverage LLMs for their superior understanding and generation capabilities on a smaller, carefully selected subset of data, and then to efficiently propagate these high-quality labels to the vast majority of unlabeled questions. This hybrid strategy significantly reduces the computational expense and manual oversight typically associated with large-scale LLM deployments, making advanced semantic understanding an accessible reality for organizations managing extensive cybersecurity assessment needs.

The SSSL Framework: How AI Automates Labeling

      The proposed SSSL pipeline is a multi-stage framework designed to transform unlabeled TPRA compliance questions into a semantically rich, labeled repository, enabling more precise question retrieval. The process can be broken down into two main phases: Repository Construction and Query-Time Inference.

      The Repository Construction phase begins by processing unlabeled TPRA questions. These questions are converted into "semantic embeddings" – mathematical representations where questions with similar meanings are positioned closer together in a multi-dimensional space. Using a technique called possibilistic clustering, these questions are then grouped into overlapping clusters based on their semantic similarity.

      Next, a Large Language Model (LLM) is used to assign initial "cluster-level semantic labels" to a small, representative subset of these clusters. These LLM-generated labels are designed to be broadly applicable and easily interpretable, forming a "seed label set." This strategic use of the LLM on a limited, high-impact subset minimizes its usage and associated costs. The labels are then aggregated to create multi-label annotations for each question within those clusters.

      For the remaining, larger portion of unlabeled questions, a k-Nearest Neighbors (kNN) classifier is employed in the Prediction Phase. This algorithm propagates the semantic labels from the small, LLM-labeled seed set to the remaining questions based on their proximity in the embedding space. This step is crucial for scalability, as it allows for efficient labeling without requiring repeated LLM inference for every single question. This approach effectively generalizes labels from a small labeled subset to large repositories while substantially reducing LLM usage and cost. Enterprises seeking to leverage advanced AI for complex data analysis can explore similar approaches with solutions like ARSA AI API, which offers modular AI capabilities for various applications.

      Finally, the Label-Based Retrieval Phase utilizes this newly labeled repository. When a user submits a query for a TPRA assessment, the system can now perform retrieval in the semantic label space, aligning selected questions more accurately with the specific control domains and assessment scopes outlined in the user's request. This ensures that the retrieved questions are not just topically related but semantically aligned with the intended assessment objective. This transformation of raw, unstructured data into actionable intelligence is a core capability that ARSA provides through its AI Video Analytics solutions, helping businesses derive strategic insights from their data.

Real-World Impact and Future Implications

      The findings from this research demonstrate that semantic labeling significantly improves the alignment of retrieved questions with intended control domains and assessment scopes, provided the labels are precise and consistent. Furthermore, the Semi-Supervised Semantic Labeling (SSSL) framework proves highly effective in generalizing labels across large repositories, boasting up to 40% lower labeling costs and a 33% reduction in runtime throughput while maintaining comparable generalization accuracy.

      For global enterprises, these advancements translate into tangible business benefits:

  • Streamlined Operations: Automating the complex process of question selection drastically reduces manual effort and speeds up TPRA workflows.
  • Enhanced Cybersecurity Posture: More precise and relevant questionnaires lead to thorough assessments, enabling organizations to better identify and address actual risks posed by third-party vendors.
  • Cost Efficiency: The hybrid SSSL approach minimizes reliance on expensive LLM inference, offering a cost-effective solution for managing vast and dynamic compliance question sets.
  • Scalability: Organizations can efficiently manage ever-growing repositories of cybersecurity questions, adapting quickly to new standards and emerging threats.
  • Improved Compliance: By ensuring assessment questions are accurately aligned with regulatory frameworks, companies can bolster their compliance efforts and reduce audit complexities.


      Implementing such advanced AI and IoT solutions to transform raw data into strategic business intelligence is a focus for ARSA Technology. Leveraging our expertise, experienced since 2018, we partner with enterprises across various industries to deploy practical, precise, and adaptive AI solutions that optimize operations, reduce risks, and enhance security, mirroring the transformational impact of semantic labeling in cybersecurity.

Conclusion

      The landscape of third-party cybersecurity risk assessment is rapidly evolving, demanding more intelligent and efficient approaches. Semantic labeling, particularly through innovative frameworks like Semi-Supervised Semantic Labeling (SSSL), offers a powerful solution to the long-standing challenge of selecting relevant questions from massive repositories. By combining the strengths of advanced AI with smart automation, organizations can move beyond surface-level analysis to achieve a deeper, more accurate understanding of their cybersecurity vulnerabilities. This precision not only enhances security and compliance but also delivers significant operational and cost benefits, making it an indispensable tool for future-proofing digital enterprises.

      To discover how ARSA Technology can help your organization leverage cutting-edge AI and IoT solutions for enhanced operational efficiency and security, we invite you to explore our comprehensive solutions and contact ARSA for a free consultation.

      Source: Nour Eldin, A., Sellami, M., & Gaaloul, W. (2026). Exploring Semantic Labeling Strategies for Third-Party Cybersecurity Risk Assessment Questionnaires. arXiv preprint arXiv:2602.10149. Available at: https://arxiv.org/abs/2602.10149