Multimodal LLMs

Advancing Real-Time AI Assistance: How Multimodal LLMs Transform Technical Operations

Explore how Multimodal Large Language Models (MLMs) are revolutionizing real-time technical assistance, leveraging visual and textual data for procedural tasks. Discover the M2AD dataset and its role in evaluating AI's ability to guide complex operations, improve efficiency, and ensure data privacy.

ARSA Technology Team

25 Mar 2026 • 5 min read

The landscape of Artificial Intelligence (AI) is rapidly evolving, with Large Language Models (LLMs) moving beyond text-only capabilities into the realm of multimodal understanding. This evolution introduces Multimodal Large Language Models (MLMs), which can process and reason over diverse inputs like images, video, and audio alongside text. This leap is crucial for AI to truly support complex, real-world tasks, especially in environments demanding real-time technical assistance. Imagine an AI not just understanding your verbal query but also "seeing" your environment through a camera or an augmented reality (AR) headset, providing context-aware guidance as you perform intricate procedures. This article explores the cutting-edge research into evaluating these MLMs for such crucial applications, highlighting a new dataset designed to push the boundaries of AI assistance.

The Dawn of Multimodal AI Assistance

For years, LLMs have proven invaluable in processing vast amounts of textual information, generating insights, and offering domain-specific problem-solving suggestions. The natural progression is to extend these capabilities to include visual inputs, enabling AI to understand the physical world in real-time. This is where MLMs shine, integrating information from multiple modalities to gain a broader, more nuanced understanding of complex scenarios. In technical assistance, this means an MLM could guide a technician through an assembly process, detect potential errors, or predict necessary actions by analyzing video feeds from a shared point of view, perhaps via AR or Virtual Reality (VR) interfaces. Such real-time, context-aware assistance holds immense potential for industries ranging from manufacturing to healthcare.

However, evaluating the true capabilities of these sophisticated MLMs remains a significant challenge. Traditional benchmarks often fall short, focusing on isolated skills rather than the sequential understanding, real-time interaction, and procedural reasoning required for practical applications. This gap necessitates specialized evaluation tools that reflect the complexities of multi-step, real-world tasks. For instance, in an industrial setting, understanding a procedure often involves not just recognizing objects but interpreting their spatial relationships, anticipating actions, and correlating visual cues with detailed instructions. ARSA, for example, offers robust AI Video Analytics solutions that leverage computer vision to provide real-time operational intelligence, demonstrating the practical application of visual AI in complex environments.

Bridging the Gap: The Manual-to-Action Dataset (M2AD)

To address the limitations in current evaluation methods, researchers have developed the Manual-to-Action Dataset (M2AD). This innovative dataset is specifically designed for assessing the reliability of MLMs in multi-step, real-world assistance scenarios. M2AD features video clips of furniture assembly, captured from various perspectives, alongside their corresponding instruction manuals. Crucially, these videos and manuals are annotated with precise step-by-step labels, allowing for direct mapping between visual actions and written instructions. This level of granular alignment is vital for training and evaluating AI systems that need to understand procedural tasks in detail.

The motivation behind M2AD stems from the need for a benchmark that moves beyond simpler tasks to evaluate MLMs on comprehensive, intricate procedures like furniture assembly. Existing datasets often lack the depth required for fine-grained understanding of both visual and textual cues in a procedural context. M2AD aims to provide the appropriate data for assessing an MLM's goodness in complex, multi-step tasks, and it's designed to be used with newer end-to-end MLMs without requiring overly complex pre-annotations.

Evaluating Multimodal AI in Action

The M2AD dataset was utilized to conduct baseline experiments on openly available MLMs, focusing on models that can run on consumer-level hardware. This choice emphasizes the importance of reproducibility and maintaining data confidentiality, a critical concern in many industrial settings. The evaluation aimed to answer three fundamental questions about MLM capabilities:

1. Reducing Labeling Needs: To what extent can the reasoning abilities of MLMs reduce the need for detailed, often costly, manual labeling practices? By leveraging AI to infer more from raw data, organizations can achieve more efficient and cost-effective annotation processes.

2. Tracking Assembly Progress: Are MLMs capable of accurately tracking the progression of assembly steps within a video, understanding which stage of a procedure is currently being performed? This is vital for real-time guidance and error detection.

3. Referencing Instruction Manuals: Can MLMs correctly refer to the specific pages or sections of an instruction manual relevant to the current visual context? This capability is crucial for providing precise, context-aware instructions to users.

The results from these baseline experiments revealed that while some MLMs demonstrated an understanding of procedural sequences, their performance was constrained by current architectural limitations and hardware capabilities. Specifically, the findings highlighted a critical need for advancements in multi-image and interleaved text-image reasoning. This suggests that future MLMs must be better equipped to process a continuous stream of visual data alongside textual instructions, constantly correlating information between modalities to maintain a coherent understanding of the ongoing task. These findings underscore the importance of edge computing solutions like the ARSA AI Box Series, which enables real-time AI processing locally without cloud dependency, ensuring low latency and enhanced data privacy for critical operations.

The Business Impact: Efficiency, Safety, and Compliance

The implications of this research for businesses are profound. Imagine a manufacturing plant where new workers can be onboarded faster and with fewer errors, guided by an AI assistant showing them exactly what to do and where, referencing detailed schematics in real-time. In maintenance, technicians can receive immediate diagnostics and repair instructions, reducing downtime and improving operational efficiency. For example, a system could detect if a worker is not wearing the correct Personal Protective Equipment (PPE) in a hazardous zone, triggering real-time alerts. ARSA’s AI BOX - Basic Safety Guard is a prime example of an edge AI solution that performs such safety and compliance monitoring for industrial environments.

Furthermore, the emphasis on data confidentiality and on-premise deployment aligns perfectly with stringent industry regulations and enterprise security needs. Companies can leverage these advanced AI capabilities without compromising sensitive operational data, a key differentiator for critical infrastructure operators and government entities. The ability of MLMs to minimize the need for extensive manual labeling also translates into significant cost savings and faster deployment cycles for new AI-powered assistance systems. This research paves the way for a future where AI isn't just a tool, but a truly intelligent partner in complex operational workflows, making assistance more reliable, efficient, and tailored to the dynamic realities of the workplace.

The ongoing advancements in Multimodal Large Language Models promise to transform how we approach technical assistance and procedural tasks. By developing robust evaluation frameworks like the M2AD dataset, researchers are paving the way for AI systems that can provide context-aware, real-time guidance with unprecedented accuracy. As these technologies mature, they will unlock new levels of efficiency, safety, and operational excellence across various industries.

To explore how ARSA Technology can provide custom AI and IoT solutions for your enterprise, we invite you to contact ARSA for a free consultation.

Source: Toschi, F., Brunello, N., Sassella, A., Scotti, V., & Carman, M. J. (2026). From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs. https://arxiv.org/abs/2603.22321