AI in education

Safeguarding Exam Integrity: The AI Challenge in High-Stakes Assessments

Explore how AI-generated exam questions pose new security risks for certification boards. ARSA Technology analyzes semantic similarity between AI outputs from public and proprietary prompts, offering insights for robust assessment strategies.

ARSA Technology Team

02 Jan 2026 • 5 min read

The Rise of AI in Content Generation: Efficiency vs. Security

Large Language Models (LLMs) have emerged as revolutionary tools for various content creation tasks, including the generation of domain-specific multiple-choice questions (MCQs). For industries like medical certification, which rely on extensive and frequently updated question banks, these AI capabilities promise substantial efficiency gains in test item development. By automating parts of the question-writing process, organizations can potentially reduce the time and resources traditionally required to create high-quality assessment materials. This innovation can help keep pace with rapidly evolving knowledge domains and reduce the burden on human subject matter experts.

However, the very accessibility that fuels this innovation also introduces significant new vulnerabilities, especially for high-stakes examinations where integrity is paramount. The ease with which anyone can now leverage LLMs, even publicly available ones, to generate practice questions raises concerns about the potential for content that closely mimics actual exam items. This isn't just about traditional security threats like question harvesting; it's about algorithmic approximation at scale. LLMs, when prompted with openly available information, can reproduce the semantic and stylistic characteristics of proprietary exam content, potentially undermining the validity of assessments.

Understanding the Dual Nature of AI Prompting: Naïve vs. Guided

The core challenge lies in determining whether LLMs, when trained or prompted solely with public information (such as official exam blueprints or competency frameworks), can produce questions that are semantically indistinguishable from those created with access to proprietary, internal resources. If such a convergence occurs, the vital boundary between confidential test content and publicly generated study material could erode, leading to widespread item overexposure and jeopardizing exam validity.

To investigate this, a study employed two distinct item-generation strategies. The "naïve strategy" involved prompting LLMs with only publicly available documents, like Entrustable Professional Activity (EPA) outlines that describe clinical activities for specific professions. In contrast, the "guided strategy" provided the LLM (specifically GPT-4o) with additional proprietary resources, including detailed internal blueprints, specific item-writing guidelines, and exemplary operational items. This dual approach allowed for a direct comparison of the distinctiveness imparted by proprietary information versus publicly accessible data. Businesses developing AI solutions or looking to integrate AI into their operations must carefully consider the source and sensitivity of the data used for training and prompting, especially when dealing with critical assets. ARSA offers AI API products that can be tailored to incorporate specific data guidelines while maintaining security protocols.

Measuring Semantic Similarity: The Science Behind AI Comparison

To quantify the degree of overlap between questions generated by the "naïve" and "guided" strategies, the study used advanced natural language processing (NLP) techniques. Question stems and response options from the generated MCQs were first transformed into "embeddings." Embeddings are numerical representations of text, where words and phrases with similar meanings are located closer together in a multi-dimensional space. Think of it like a sophisticated indexing system that captures the semantic essence of the text.

The study employed PubMedBERT and BioBERT, two specialized biomedical encoders. These are AI models specifically optimized for understanding medical-specific vocabulary and contexts, ensuring that the semantic analysis was highly relevant to the clinical questions being generated. Once the text was converted into these numerical embeddings, "cosine similarity coefficients" were calculated. Cosine similarity is a metric that measures the cosine of the angle between two vectors (the embeddings). A value close to 1 indicates high similarity (the vectors point in almost the same direction), while a value close to 0 indicates low similarity (the vectors are nearly orthogonal). This allowed researchers to objectively quantify how semantically alike the questions were, both within and across the different prompting strategies. Such rigorous data analysis is crucial in evaluating the performance and security of analitik video AI and other AI systems for various business applications.

Key Findings: Where AI-Generated Questions Converge

The results revealed several critical insights into the behavior of LLMs in generating exam questions. Overall, there was high internal consistency within each prompting strategy, meaning questions generated by the same method tended to be quite similar to each other. For example, BioBERT showed a similarity of 0.77 for naïve prompts and 0.71 for guided prompts, while PubMedBERT showed 0.87 for naïve and 0.83 for guided. These high scores suggest that LLMs consistently adhere to the style and content implied by their initial prompts.

However, the cross-strategy similarity (comparing naïve to guided items) was lower overall (BioBERT: 0.56; PubMedBERT: 0.70). This suggests that proprietary resources do impart some distinctiveness. Crucially, the study found that for several domain-model pairs, especially in narrowly defined clinical areas such as viral pneumonia and hypertension, the cross-strategy similarity exceeded the 0.65 "high-similarity" threshold. This significant finding indicates a concerning convergence: even without access to internal, proprietary guidance, LLMs prompted only with publicly available information can generate items that closely resemble those produced using privileged internal resources. This phenomenon heightens the risk of item exposure, as easily accessible AI could inadvertently replicate or predict sensitive exam content.

Implications for High-Stakes Assessments and Business Integrity

The study's findings carry profound implications for organizations responsible for high-stakes assessments, particularly in fields requiring strict certification like medicine. The ability of publicly accessible LLMs to generate questions semantically similar to those produced with proprietary data creates a scalable test security threat. This can undermine the validity and fairness of examinations, erode public trust in certification processes, and potentially lead to significant financial and reputational damage for the assessing bodies.

For businesses and educational institutions, this highlights the necessity of a "human-first, AI-assisted" approach to content development. While AI offers powerful tools for efficiency, human oversight and expert editing remain indispensable for maintaining quality and security. Organizations must consider stricter separation of formative (practice) and summative (official) item pools. Furthermore, implementing "systematic similarity surveillance" – using AI to detect overlap between newly generated items and existing or public content – becomes a critical safeguard. This proactive measure ensures that innovation through AI does not compromise the fundamental integrity of assessments. ARSA's expertise, berpengalaman sejak 2018, in AI and IoT can help businesses build secure, robust systems for sensitive data handling and analytics across berbagai industri.

Safeguarding Your Intellectual Property in the AI Era

The convergence of AI-generated content poses not only an exam security risk but also a broader threat to intellectual property (IP) and proprietary knowledge across various business sectors. Companies investing heavily in developing unique content, training materials, or strategic documents must recognize that publicly available information about their operations or products can potentially be leveraged by LLMs to create highly similar outputs. This underscores the need for a re-evaluation of what constitutes truly proprietary information in the age of generative AI.

Organizations must carefully consider the granularity of information they release publicly. While transparency is valuable, disclosing overly detailed blueprints or frameworks could inadvertently provide LLMs with sufficient context to mimic internal content. Implementing robust AI governance policies, including guidelines for LLM usage in content creation and the systematic monitoring of generated materials for similarity, is no longer optional. Adopting advanced AI solutions like the ARSA AI Box series can help companies deploy edge computing for local data processing, offering a "privacy-first" approach where sensitive data does not leave the premises, thereby bolstering IP protection against cloud-dependent AI solutions.

Are you ready to enhance your organizational security and operational efficiency with cutting-edge AI solutions? For a personalized consultation on how AI and IoT can transform your business while safeguarding critical assets, contact ARSA today.

Siap Mengimplementasikan Solusi AI untuk Bisnis Anda?

Tim ahli ARSA Technology siap membantu transformasi digital perusahaan Anda dengan solusi AI dan IoT terkini. Dapatkan konsultasi gratis dan demo solusi yang tepat untuk kebutuhan industri Anda.

💬 Hubungi via WhatsApp 📧 Kirim Email