Advancing Federated Learning: PrivFusion for Privacy-Preserving Data Harmonization in Distributed AI

Explore PrivFusion, a multi-agent framework that automates privacy-preserving data harmonization for federated learning, tackling heterogeneity in sensitive datasets for healthcare and enterprise AI.

Advancing Federated Learning: PrivFusion for Privacy-Preserving Data Harmonization in Distributed AI

The Unseen Hurdle in Collaborative AI: Data Heterogeneity

      The digital age has ushered in an unprecedented volume and complexity of data, particularly across sensitive sectors like clinical medicine, public health, and finance. This surge has fueled the adoption of advanced Machine Learning (ML) techniques for critical tasks, from predicting health risks to detecting fraud. However, a significant barrier often prevents the full potential of these ML applications: the necessity for centralized data aggregation. For sensitive information like electronic health records or proprietary financial data, centralizing data is often impossible due to stringent privacy regulations, security concerns, and governance policies. Institutions are frequently unwilling to share raw data, which severely limits the scope for large-scale, multi-site analytics and robust model training.

      Federated Learning (FL) has emerged as a powerful paradigm to circumvent these challenges. It allows multiple institutions to collaboratively train a shared ML model while keeping their sensitive raw data securely on-premises. While FL promises a future of collaborative intelligence without compromising privacy, its real-world deployment faces a central, often overlooked obstacle: data heterogeneity. Clinical datasets, for example, commonly exhibit vast differences in how features are defined, the coding systems used, granularity levels, measurement practices, and overall data quality. Before any meaningful distributed analysis or model training can commence, these diverse datasets must be meticulously harmonized – a process that involves aligning variables, resolving semantic inconsistencies, and standardizing representations.

Federated Learning: A Promise Held Back by Data Differences

      The core principle of Federated Learning is to enable collaborative intelligence where data remains local. Instead of sending raw data to a central server, only model updates are shared. This protects sensitive information, making it ideal for industries under strict regulatory frameworks such as healthcare. Yet, the foundational assumption in most FL methodologies is that data harmonization has already been completed. This assumption is often unrealistic, transforming data heterogeneity into a major bottleneck that delays or limits the scalability of federated studies. The manual effort required to align disparate datasets can be immense, demanding significant resources and specialized expertise.

      Existing approaches to data harmonization, such as ontology-driven or semantic techniques, attempt to alleviate some of these differences. However, they frequently require a degree of centralization or rely on predefined common formalisms and mappings. Establishing these common standards across multiple institutions, each with its unique workflows and data models, is a monumental task. This highlights a critical need for automated, scalable, and genuinely privacy-preserving harmonization methods that can operate effectively in distributed environments without requiring sensitive data centralization. Such capabilities are essential to reduce the manual burden on participating institutions and broaden access to collaborative analytics, ultimately leading to more generalized and impactful AI models.

Introducing PrivFusion: A Multi-Agent Framework for Intelligent Harmonization

      To bridge this gap, IBM Research Dublin developed PrivFusion, a groundbreaking privacy-preserving multi-agent framework specifically designed to automate the harmonization of structured datasets. Unlike traditional methods, PrivFusion integrates directly into the workflow prior to federated model training, ensuring data consistency from the outset. Its multi-agent architecture allows for distributed intelligence, where autonomous software agents perform specific tasks, enabling a highly efficient and adaptable harmonization process. The framework uses intelligent agents to analyze local data, cluster semantically similar features across different sites, and then generate iterative transformation recommendations until full data alignment is achieved.

      PrivFusion tackles the complexity of diverse datasets head-on. By automating the identification and resolution of inconsistencies, it significantly reduces the manual effort typically associated with preparing data for multi-site analytics. This innovative approach ensures that even highly heterogeneous datasets, such as those found across various clinical institutions, can be prepared for collaborative AI without compromising the privacy or security of the underlying data. Technologies like ARSA's AI Video Analytics often leverage harmonized data to deliver accurate insights, underscoring the importance of such foundational frameworks.

How PrivFusion Works: An Iterative Dance of Data and Intelligence

      The PrivFusion framework operates through a series of privacy-preserving, iterative steps, facilitating collaboration between individual researchers (or institutions) and a central server. The goal is to harmonize data features without exposing raw sensitive information.

      1. Local Data Analysis by Agents: Each participating researcher first conducts a local analysis of their dataset using a suite of specialized AI agents.

  • Feature Data Type Extraction: Agents identify basic data types (e.g., string, numeric, floating point).
  • Semantic Type Association: Features are associated with semantic concepts, often by mapping them against a selected general-purpose ontology like DBPedia. An ontology, in simple terms, is a structured knowledge base that defines concepts and their relationships, helping the system understand the meaning behind data labels.
  • Dataset and Feature Descriptions: Agents generate high-level descriptions of the entire dataset and detailed semantic descriptions for individual features.
  • Topic and Relationship Inference: Agents identify relevant topics within the dataset and infer relationships between different features (e.g., "patient age" is related to "diagnosis date").


Synthetic Sample Generation: Crucially, each researcher generates a small number of synthetic samples*. These are low-utility preserving samples, meaning they maintain the format and domain of original feature values but do not contain any actual sensitive individual data. State-of-the-art Differential Privacy (DP) approaches can be employed here, which mathematically guarantee that individual data cannot be inferred from the synthetic samples, thereby protecting privacy.

      2. Metadata Sharing and Centralized Processing: Instead of raw data, researchers send this comprehensive metadata (dataset description, feature names and types, semantic descriptions, topics, inferred relationships, and synthetic samples) to the central server.

      3. Clustering and Recommendation Generation: Upon receiving metadata from all participating sites, the server uses it to cluster semantically related features across the diverse datasets. Based on these clusters, a dedicated agent generates harmonization recommendations for each site. These recommendations precisely specify which features require transformation and the target representations needed for alignment (e.g., converting "age_in_years" to a standardized "patient_age_numerical" format).

      4. Iterative Refinement: Each researcher applies the recommended transformations locally to their dataset. Once applied, updated metadata is generated and resent to the server. This iterative process continues until no further transformations are required, indicating that the datasets are sufficiently harmonized for federated training. The evaluation demonstrated that PrivFusion typically achieves harmonization within 2–3 iterations across dataset pairs, even when varying the underlying Large Language Model (LLM) used for analysis. The consistent increase in feature-name similarity over successive iterations indicates a progressive convergence of the data schemas. For organizations deploying custom AI solutions, this iterative refinement ensures foundational data quality, a crucial factor that ARSA Technology emphasizes in its custom AI solutions offerings.

Ensuring Privacy and Trust in Distributed Environments

      PrivFusion is built with a strong emphasis on privacy and trust, operating within an "honest-but-curious" threat model. In this model, the central server is assumed to follow the protocol correctly but may attempt to infer sensitive information from the shared metadata. PrivFusion mitigates these risks by ensuring that no raw, sensitive data ever leaves the local environment. The use of synthetic samples, generated with Differential Privacy, is a cornerstone of this privacy-preserving design. These samples provide the server with sufficient statistical information about data distribution and format for harmonization purposes, without revealing any individual-level details.

      This privacy-by-design approach is critical for the adoption of collaborative AI, especially in highly regulated sectors. By providing robust safeguards against known attacks like membership inference (determining if an individual is in the dataset) and attribute inference (guessing sensitive attributes about individuals), PrivFusion fosters an environment where institutions can confidently participate in shared AI initiatives. This level of data control and privacy aligns with the needs of enterprises that might utilize platforms such as the ARSA AI Box Series for on-premise, edge AI processing where data sovereignty is paramount.

Real-World Impact and Future Implications

      The successful evaluation of PrivFusion on four real-world COVID-19 datasets highlights its practical efficacy. The ability to harmonize disparate clinical data with minimal iterations and reduced manual effort represents a significant leap forward for federated learning in healthcare. This framework streamlines a critical prerequisite for multi-site analytics, enabling researchers to leverage larger, more diverse datasets to build more robust and generalizable predictive models. The implications extend far beyond healthcare, offering profound benefits for any industry dealing with sensitive, distributed data, such as financial services, smart city initiatives, or government defense.

      By automating and securing the data harmonization process, PrivFusion empowers organizations to unlock the full potential of federated AI. It accelerates digital transformation by turning complex, fragmented data landscapes into harmonized, actionable intelligence. This innovation not only reduces operational costs and enhances security but also creates new opportunities for insights and revenue generation that were previously hindered by data silos and privacy concerns.

      As AI continues to evolve, frameworks like PrivFusion will be instrumental in fostering secure, collaborative intelligence across global enterprises.

      For organizations looking to implement advanced AI and IoT solutions, ensuring robust data preparation and privacy is crucial. Explore ARSA's enterprise-grade solutions and contact ARSA for a free consultation on how we can help transform your data into a strategic asset.

      Source: Anisa Halimi et al. (2026). PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets. https://arxiv.org/abs/2605.24249