AI data extraction

AI-Powered Insights: Measuring Open Science in Transportation Research with Large Language Models

Explore how Large Language Models (LLMs) are revolutionizing the measurement of open science practices in transportation research, offering scalable, accurate insights into data and code availability.

ARSA Technology Team

22 Jan 2026 • 5 min read

The Imperative of Open Science in Modern Research

Open science, a movement advocating for transparency and accessibility in scientific research, has become a cornerstone for strengthening scientific integrity and accelerating progress across numerous disciplines. Its core tenets — making research data, methods, and publications freely available — foster collaboration, enable reproducibility, and build trust in scientific findings. Organizations like UNESCO champion Open Science Monitoring (OSM), emphasizing that transparent measurement is crucial for guiding and accelerating the transformation of research practices.

The principles of open science directly support reproducibility and replicability in science. Reproducibility ensures that the same results can be obtained using the identical data and methods, while replicability involves achieving similar results with different data or methods. The fundamental step towards these goals is the mere availability of data and code, allowing other researchers to verify and expand upon original work, regardless of initial quality.

Challenges in Measuring Open Science Adoption

Despite the clear benefits and growing recognition, assessing the actual adoption of open science practices, such as data and code availability, remains a significant challenge. Traditional methods typically involve labor-intensive manual review of individual papers. This approach is not only time-consuming and costly but also struggles with scalability when applied to vast volumes of research articles. Furthermore, manual annotation can lead to inconsistencies and inaccuracies, particularly when multiple reviewers are involved without stringent, unified guidelines.

Simpler automated approaches, relying on keyword searches or regular expressions, have attempted to improve scalability. However, these methods often fall short in capturing the subtle context surrounding information. For instance, they might mistakenly flag a GitHub link in a citation as evidence of shared code, rather than understanding its true purpose as a reference. This creates a dilemma: how to achieve both accuracy and scalability without sacrificing one for the other?

Leveraging Large Language Models for Contextual Understanding

The inherent complexity of fields like transportation research further compounds the measurement challenge. The line between simply using publicly accessible data and proactively making research artifacts available can be blurred. A study might rely on a well-known public dataset but fail to release the specific processed or derived datasets crucial for replicating its analysis. This ambiguity transforms data and code availability from a simple binary state into a spectrum, demanding nuanced understanding.

To address these complexities, innovative approaches are needed. The latest advancements in Artificial Intelligence, particularly Large Language Models (LLMs), offer a powerful solution. LLMs can analyze full-text articles, capturing the intricate contextual signals that evade traditional keyword-based methods. This enables automated, yet highly accurate, extraction of information regarding data and code availability, bridging the gap between manual precision and broad scalability.

A Novel AI-Powered Pipeline for Research Monitoring

A recent study introduced an automated feature-extraction pipeline leveraging LLMs to measure data and code availability in transportation research. This pipeline processes thousands of research articles, meticulously identifying and validating artifact links and classifying their availability. It represents a critical step towards automated measurement and monitoring of open science practices, establishing a repeatable and cost-effective system.

This methodology formalizes definitions and decision rules upfront, ensuring transparency and reproducibility in measurement. By analyzing Elsevier XML files at scale, the LLM-powered system can precisely detect whether a paper truly makes its data and code repositories accessible, distinguishing this from mere mentions of public datasets or incidental links. This capability is paramount for generating reliable insights into open science adoption.

Key Findings on Open Science in Transportation Research

Applying this advanced AI pipeline to over 10,000 research articles in prominent transportation journals published between 2019 and 2024 revealed some striking trends. The analysis found that only a small fraction of quantitative papers actively shared their research artifacts: approximately 5% provided a code repository, 4% shared a data repository, and merely 3% shared both. These rates varied across different journals, topics, and geographical regions, highlighting inconsistencies in adoption.

Perhaps even more significantly, the study found no noticeable difference in citation counts or review duration between papers that shared data and code and those that did not. This suggests a significant misalignment between current open science efforts and traditional academic incentives. Without direct encouragement, such as recognition in career advancement or funding decisions, the adoption of open science practices may struggle to gain widespread traction. This points to a clear need for structural interventions from journals and funding bodies to foster these crucial practices.

Beyond Academia: Practical AI for Enterprise Data Extraction

The principles underlying this academic research extend far beyond monitoring scientific publishing. The ability of AI, particularly LLMs, to perform scalable and accurate contextual data extraction holds immense value for various enterprise applications. Businesses constantly grapple with vast amounts of unstructured text data — contracts, reports, customer feedback, and compliance documents — where critical insights are often buried.

For instance, in highly regulated industries, an AI pipeline similar to this could automatically audit documents for compliance with specific standards or detect policy adherence. In legal or financial sectors, it could extract key terms and conditions from complex agreements, reducing human error and expediting review processes. This intelligent automation streamlines operations, reduces costs associated with manual data processing, and provides actionable intelligence. ARSA's expertise in leveraging AI for nuanced data processing is evident in solutions like AI Video Analytics, which transforms raw visual data into actionable security and operational insights, or the ARSA AI API, which allows businesses to seamlessly integrate advanced AI capabilities into their existing applications.

ARSA's Approach to Driving AI-Powered Transformation

At ARSA Technology, we understand that unlocking value from complex data requires a blend of deep technical expertise and a practical, ROI-driven approach. Our solutions harness the power of AI and IoT to convert challenging data extraction and monitoring tasks into efficient, automated processes. Whether it's enhancing operational efficiency, ensuring compliance, or generating new insights, ARSA designs systems that deliver tangible business outcomes.

Our AI Box Series, for example, provides edge computing power for real-time analytics, processing data locally to ensure maximum privacy and instant insights. This on-premise processing capability is ideal for scenarios where data residency and low latency are critical. Having been experienced since 2018, ARSA is a trusted partner across various industries, ready to tackle unique business challenges with proven, scalable AI solutions.

The development of AI pipelines that can accurately and scalably measure complex phenomena, whether it's open science adoption in research or compliance in an industrial setting, represents a significant leap forward. By providing precise, data-driven insights, these technologies empower organizations to make informed decisions and drive meaningful change.

Ready to harness the power of AI for your organization's data extraction and monitoring needs? Explore ARSA's comprehensive AI and IoT solutions and contact ARSA today for a free consultation.