Matswarm: trusted swarm transfer learning driven materials computation for secure big data sharing

Nature

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT The rapid advancement of Industry 4.0 necessitates close collaboration among material research institutions to accelerate the development of novel materials. However,

multi-institutional cooperation faces significant challenges in protecting sensitive data, leading to data silos. Additionally, the heterogeneous and non-independent and identically

distributed (non-i.i.d.) nature of material data hinders model accuracy and generalization in collaborative computing. In this paper, we introduce the MatSwarm framework, built on swarm

learning, which integrates federated learning with blockchain technology. MatSwarm features two key innovations: a swarm transfer learning method with a regularization term to enhance the

alignment of local model parameters, and the use of Trusted Execution Environments (TEE) with Intel SGX for heightened security. These advancements significantly enhance accuracy,

generalization, and ensure data confidentiality throughout the model training and aggregation processes. Implemented within the National Material Data Management and Services (NMDMS)

platform, MatSwarm has successfully aggregated over 14 million material data entries from more than thirty research institutions across China. The framework has demonstrated superior

accuracy and generalization compared to models trained independently by individual institutions. SIMILAR CONTENT BEING VIEWED BY OTHERS THE COLLABORATIVE ROLE OF BLOCKCHAIN, ARTIFICIAL

INTELLIGENCE, AND INDUSTRIAL INTERNET OF THINGS IN DIGITALIZATION OF SMALL AND MEDIUM-SIZE ENTERPRISES Article Open access 30 January 2023 SWARM LEARNING FOR DECENTRALIZED AND CONFIDENTIAL

CLINICAL MACHINE LEARNING Article Open access 26 May 2021 DESIGN OF AN IMPROVED MODEL USING FEDERATED LEARNING AND LSTM AUTOENCODERS FOR SECURE AND TRANSPARENT BLOCKCHAIN NETWORK

TRANSACTIONS Article Open access 10 January 2025 INTRODUCTION The integration of Industrial Internet of Things (IIoT) and machine learning is revolutionizing research and development in

materials science1,2. The advent of Industry 4.0 has revolutionized materials science through the integration of IIoT. Advanced sensors and data acquisition technologies enable real-time

monitoring of material parameters such as temperature, hardness, melting point, and boiling point, providing unprecedented data support3. Concurrently, machine learning algorithms analyze

this vast data, allowing researchers to predict material properties, optimize designs, and discover new materials based on performance, structural properties, and preparatory conditions4,5.

However, creating accurate predictive models requires large, diverse training datasets. Today, various materials and big data platforms6,7,8 have been developed, providing researchers with

aggregated data. Nonetheless, for sensitive datasets that cannot be publicly shared, material data mining and analysis remain limited due to small sample sizes9,10. This poses a challenge

for training effective models. While data augmentation11,12 offers a potential solution, relying on simulated data can compromise model accuracy and generalization13,14. Additionally, even

with sufficient samples, standardized testing environments and methodologies can limit data diversity, further hindering model generalization for new materials. Transfer learning15 is often

used as a solution, but it involves sharing complete models with third parties, which raises concerns about data security and potential leakage. To accelerate the development of new

materials, a secure and collaborative computing methodology is essential. This approach must ensure data protection while allowing collaborative modeling across different organizations to

improve model accuracy and generalization. federated learning (FL)16 offers a viable solution by enabling organizations to collaborate without revealing their original data, sharing only

insights from local models. This protects sensitive data while allowing effective aggregation17. However, the traditional FL framework, which relies on a central server to aggregate local

model parameters18, raises concerns about the integrity and authenticity of the global model19. This centralization also makes the server susceptible to internal and external attacks20,21.

Moreover, most existing FL solutions have primarily been validated theoretically, using publicly available datasets and focusing on classification problems22,23. This theoretical focus fails

to address the practical challenges faced by non-i.i.d. datasets owned by different organizations, where issues of model accuracy and generalization are more pronounced. The lack of

empirical validation in real-world applications further questions the practicality and feasibility of these solutions. To truly harness the potential of FL in materials science, it is

crucial to develop methodologies that not only perform well in controlled, theoretical settings but also demonstrate robustness and effectiveness in diverse, real-world environments. This

will ensure the models are reliable, secure, and capable of advancing material discovery and development. To accelerate materials science research and development, building on the Materials

Genome Engineering (MGE) project24, we developed the NMDMS platform25,26 to facilitate the collection, storage, retrieval, and computation of material data. As the cornerstone of MGE’s data

applications, NMDMS platform provides data consumers with access to an extensive repository of material data contributed by over thirty research institutions across China. This platform also

serves as a data exchange and sharing hub for materials researchers. Although the NMDMS platform provides basic collaborative computing services, it lacks solutions for handling the

inherent limitations of FL in the context of material science. For example, while it achieves relatively high prediction accuracy for i.i.d. (independent and identically distributed)

training sets, it falls short in generalization capability for non-i.i.d. training sets and cannot ensure the confidentiality and integrity of parameters during the training process. Here,

we introduce MatSwarm as part of the NMDMS platform to address the limitations in materials science collaboration, particularly in the context of Industry 4.0, where efficient cooperation

among research institutions is crucial for accelerating novel material development. MatSwarm tackles the challenges posed by non-i.i.d. data and ensures the confidentiality and integrity of

sensitive material information through a decentralized collaborative computing framework. To the best of our knowledge, this application of the MatSwarm framework is unprecedented in the

materials field. Validated with real datasets from NMDMS, MatSwarm significantly enhances model training accuracy and generalization under heterogeneous data conditions. Additionally, by

integrating trusted execution environments (TEE) based on Intel SGX, the framework ensures secure and accurate model aggregations. Ultimately, MatSwarm not only addresses the collaborative

computing challenges but also unlocks the full potential of material data, driving innovation and meeting the demands of high-throughput computing and experimentation, thus accelerating

material discovery. A general introduction on MatSwarm is available in Supplementary Movie 1. RESULTS To date, the MatSwarm platform for material genome engineering (MGE) has collected over

14 million pieces of valid material data27. The platform predominantly encompasses data on special alloys, material thermodynamics/kinetics, composite functional materials, catalytic

materials, first-principles calculations, and biomedical materials. Data consumers from various fields can submit sharing tasks via the framework according to their specific needs, enabling

collaborative prediction of material properties and the development of new materials with other stakeholders. In our experiments, we utilize the prediction of perovskite formation energies

as an illustrative example to evaluate the performance of the MatSwarm framework. The following research questions (RQs) are addressed: * RQ 1: How does MatSwarm address security issues

during the collaborative computing process in the material science domain? * RQ 2: What are the advantages of MatSwarm compared to other methodologies within the MatSwarm? * RQ 3: How do

different factors affect the performance of MatSwarm, such as data distribution (non-i.i.d. vs i.i.d.), different local models and aggregation methods, and TEE? * RQ 4: How scalable is

MatSwarm in terms of its performance, including the size of the dataset, the number of features, and the number of participants? EXPERIMENTAL SETUP In this experiment, all services and

participants’ applications were deployed on cloud servers. The 16-core Intel Xeon (Ice Lake) Platinum 8369B processor with 32GB RAM (16GB as trusted RAM) was used to enable Intel ®Software

Guard Extensions, allowing organizations to employ enclaves for protecting the confidentiality and integrity of their code and data. The MatSwarm framework was implemented on a consortium

blockchain based on Hyperledger Fabric, with each node initiated as a Docker container and connected to the blockchain network using Docker Swarm28. Local models and aggregation methods are

available for participants to choose from on the MatSwarm platform. The batch size was set to 128, the number of iterations was 200, and the learning rate was 0.002. In the training

objective, _γ_ and _λ_ were set to 0.5 and 1, respectively. Dataset and Model Selection. In our experiments, we illustrate our approach using the prediction of perovskite formation energies

as a case study. We utilized perovskite data from our NMDMS platform to evaluate the performance of the MatSwarm framework, selecting 4016 perovskite samples. The training set consists of

3694 samples, evenly distributed among organizations. The test set comprises 322 samples. Detailed feature engineering on the dataset is described in Supplementary Note 4. Unless specified

otherwise, the number of participants in the experiment is set to three. This experiment aimed to test the performance of MatSwarm for non-i.i.d. material data. To this end, we divided the

training dataset into non-independent and identically distributed (non-i.i.d.) and independent and identically distributed (i.i.d.) datasets for comparative testing. For the non-i.i.d.

dataset, since the label values are normally distributed, we divided the training set into three datasets with different means and variances. The distribution of label values in these

datasets is illustrated in Supplementary Fig. 10. Regarding model selection, unless otherwise specified, the local training models utilize a Multilayer Perceptron (MLP) neural network for

training, with a hidden size of 32 and three network layers. On the MatSwarm framework, the task issuer can select different local training models and aggregation methods based on the

sharing task. For joint training, all organizations’ data was combined, and model training was also conducted using the MLP neural network. Evaluated Attacks. In this scenario, four nodes

participate in the collaborative task, with one acting as a Byzantine node launching the attack. Since all attack methods target the gradients, we modify the model updates in this experiment

to gradients instead of the model parameters. The aggregation methods include the five provided by the MatSwarm framework. We evaluate the impact of different attack methods on the accuracy

of these aggregation methods both inside and outside the TEE. Given the susceptibility of existing swarm learning frameworks to data poisoning attacks29, our experiment aims to demonstrate

the robustness of MatSwarm against such attacks. We consider the following popular poisoning attacks: * _Noise Attack_. The Byzantine nodes send noise-perturbed gradients generated by adding

Gaussian noise to the honest gradients30. We set the Gaussian distribution parameters as ${\mathbb{N}}(0.1,0.1)$. * _Label-Flipping_. The Byzantine nodes flip the local sample labels

during the training process to generate faulty gradients31. Specifically, a label _l_ is flipped as − _l_, where _l_ is the formation energy of perovskite in our experiment. *

_Sign-Flipping_. During each round of learning, participants calculate the gradients ∇ _f__Θ_ of the local model, which are then uploaded to a central server for aggregation32. After

calculating the local gradients, the Byzantine nodes flip the signs of these gradients and send − ∇_f__Θ_. * _A Little is Enough_. The Byzantine nodes send malicious gradient vector with

elements crafted33. For each node _i_ ∈ [_d_ ], where _d_ is the number of Byzantine nodes, the Byzantine nodes calculate mean (_μ_i) and standard deviation (_σ_i) over benign updates, and

set corrupted updates _Δ_i to values in the range $({\mu }_{{{{\rm{i}}}}}-{z}_{\max }{\sigma }_{{{{\rm{i}}}}},{\mu }_{{{{\rm{i}}}}}+{z}_{\max }{\sigma }_{{{{\rm{i}}}}})$, where \({z}_{\max

}\) ranges from 0 to 1. We set ${z}_{\max }=0.3$ in our experiment. * _Inner Product Manipulation_. The primary goal of IPM is to disrupt model performance by manipulating the inner

product of gradients to affect the direction and speed of model training34. For example, an attacker could enhance or diminish the effects of gradients in a particular dimension. We set the

scaling factor _α_ = 2, the gradient mean to be $\overline{\nabla {f}_{{{{\boldsymbol{\theta }}}}}}$, and the gradient of the attack sent to be \(-\overline{\nabla

{f}_{{{{\boldsymbol{\theta }}}}}}*\alpha\). SECURITY ANALYSIS (RQ1) Confidentiality protection for local datasets: this framework enables collaborative computing among multiple organizations

while maintaining the confidentiality of local datasets. Traditional centralized machine learning methods require storing all datasets on a central server, posing risks of sensitive data

leakage. Through MatSwarm, each organization trains models on its local dataset without sharing the original data. Instead, organizations only share encrypted model parameters, not raw data.

This approach prevents the disclosure of sensitive information without disrupting the task processes of the participating organizations. Secure model training and aggregation based on TEE:

ensuring the security of model training processes during swarm learning is a significant challenge. To address this issue, MatSwarm employs a TEE established by Intel SGX. In this framework,

the original dataset is encrypted before being sent to the blockchain, using a shared key established through the Diffie-Hellman key exchange protocol. This ensures that data cannot be

stolen or tampered with during transmission. During model training and aggregation, the SGX Enclave performs these operations in a trusted execution environment, preventing attackers from

accessing or altering model parameters. Blockchain-based secure storage: this framework uses blockchain technology to replace untrusted third parties, significantly reducing the risk of data

leakage. Smart contracts are employed to standardize and automate model training and aggregation processes. Transactions are stored on the blockchain in hash form, and due to the uniqueness

of hash values, any tampering with transaction data will result in a change in the hash value. During the consensus process, nodes reject transactions with inconsistent hash values,

ensuring the integrity of global model storage and preventing network attacks. Additionally, digital signatures and hashes protect model updates, further enhancing the security of model

training and preventing tampering or contamination. Impact of attacks on different aggregation methods inside/outside TEE: as shown in Fig. 1a–e present test results in a non-TEE

environment. The results indicate that different aggregation methods, by design, can resist various data poisoning attacks. However, no single aggregation method can resist all types of data

poisoning attacks. The convergence speed and final model accuracy are affected to varying degrees depending on the specific attack and aggregation method used. To verify the TEE’s

resistance to data poisoning attacks, we tested the aggregation methods that were ineffective in the non-TEE within the TEE. Figure. 1f–j show that the TEE effectively resists all types of

data poisoning attacks. The convergence speed and model accuracy remain virtually unaffected, closely matching the performance observed in the absence of attacks. This demonstrates that

MatSwarm can effectively mitigate the risk of data poisoning attacks. METHODOLOGIES COMPARISON (RQ2) Within the MatSwarm framework, we conducted comparative experiments on prediction

accuracy and response time between MatSwarm, local independent training (referred to as “Solo”), joint data training (referred to as “Joint”), and other existing solutions, including

FedAvg35, FedProx36, Homomorphic Encryption Federated Transfer Learning (referred to as “HE-FTM”)37, and a similar framework proposed by Kalapaaking et al. (referred to as “Trust-FL”)38, to

illustrate the performance advantages of this framework. The performance comparison between MatSwarm and other methodologies is presented in Fig. 2. The results of model accuracy, evaluated

using mean squared error (MSE), are shown in Fig. 2. MatSwarm significantly improves prediction accuracy compared to Solo while maintaining the privacy of local datasets across various

organizations. Among the methodologies compared, MatSwarm achieves prediction accuracy closest to that of Joint, which can be considered the upper bound of accuracy for collaborative

computation. In contrast, HE-FTM involves polynomial approximation for evaluating nonlinear functions, resulting in some accuracy loss during training. Trust-FL, employing a horizontal FL

model, is more suited for i.i.d. training data and is less effective at predicting non-i.i.d. material data models. In terms of prediction accuracy, our model is more suitable for the

material science domain, demonstrating better prediction accuracy. Regarding response time, as shown in Fig. 2, Solo exhibits the shortest response time for each organization. Due to the

communication required between organizations, MatSwarm takes longer to execute compared to Solo and Joint. The average response time of MatSwarm increases by approximately 4 seconds compared

to Joint. Despite this increase, the security and privacy protection offered by MatSwarm are highly valuable. Moreover, in practical applications, organizations typically do not require

real-time model training, and the response time difference remains within an acceptable range. Notably, compared to HE-FTM, our model demonstrates lower computational complexity and

significantly improved response time. Compared to Trust-FL, our framework shows a slight increase in response time, primarily due to the enhanced security measures. Model training in our

framework occurs in a trusted execution environment, adding some communication overhead. Additionally, the blockchain consensus algorithm, inspired by PBFT, effectively addresses security

concerns arising from Byzantine nodes. Although the consensus algorithm slightly impacts response time by increasing communication frequency, the trade-off is justified by the improved

security performance. ABLATION EXPERIMENT (RQ3) To understand how different factors affect the performance of MatSwarm, we conducted ablation experiments varying data label distribution,

local model architectures, and w/wo a TEE. Unless specified otherwise, all local models use the same MLP architecture, and the aggregation algorithm is Mean. 1) non-i.i.d. VS i.i.d.: To

demonstrate the performance of the MatSwarm on non-i.i.d. material data, we tested both non-i.i.d. and i.i.d. datasets. * _i.i.d. Training Sets_: Fig. 3a depicts the prediction results for

perovskite formation energy using i.i.d. training sets selected independently for each organization. Our algorithm exhibits extremely high prediction accuracy for the i.i.d. dataset, nearing

the accuracy of Joint. * _Non-i.i.d. Training Sets_: As shown in Fig. 3b, the prediction accuracy for the non-i.i.d. dataset is slightly lower compared to the i.i.d. dataset but still close

to the accuracy of Joint. Compared to Solo, the accuracy for Org1 decreases due to the different label distributions between its data and the test set. A similar trend is observed for Org3.

However, as displayed in Table 1, using MatSwarm for predictions, the prediction MSE for Organization 1 decreased from 1.0291 to 0.2096, and for Organization 3, it decreased from 1.6159 to

0.5849, with the global model achieving an accuracy as low as 0.0903. Despite Organization 2 having a similar label distribution to the test set and thus showing good prediction accuracy,

its local model prediction accuracy also improved slightly after training with MatSwarm. This demonstrates that MatSwarm has strong generalization capabilities for non-i.i.d. material data.

2) Different local models and aggregation methods: Since MatSwarm will perform various training tasks beyond predicting perovskite formation energy, the choice of local models and

aggregation methods significantly impacts the accuracy of model training for different tasks. In this experiment, we compared the performance of MatSwarm using different local models and

aggregation methods to identify the most suitable collaborative computing scheme for predicting perovskite formation energy. The local models capable of solving regression problems include

MLP, recurrent neural network (RNN), Lasso, and long short-term memory (LSTM). The aggregation methods considered are Mean, Median, MultiKrum, CenteredClipping, and GeoMed. Ultimately, we

obtained the prediction results shown in Fig. 3c, d, and Table 1. The results indicate that using MLP within MatSwarm is the most suitable for predicting perovskite formation energy.

Building on the MLP local model architecture, we tested the impact of different aggregation methods on model accuracy and response time. In terms of accuracy, Mean and CenteredClipping

achieved the higher precision, while Mean was the most efficient in terms of response time. Therefore, to choose a suitable aggregation method, one should balance the trade-offs among the

needs of efficiency, accuracy, and security to achieve an optimal solution. This modular development approach facilitates participants in selecting the most suitable solutions for training

tasks and simplifies platform iterations and updates to meet diverse training demands in the material science domain. 3) non-TEE vs TEE: To evaluate the impact of TEE on the accuracy and

efficiency of the MatSwarm framework, we compared the MSE and response time of MatSwarm before and after using TEE. The comparison, shown in Fig. 4, indicates that using TEE does not

significantly affect the prediction accuracy of the model, whether training is conducted individually, with MatSwarm, or on joint data. However, the use of TEE introduces some communication

overhead, leading to an increase in response time. In the materials science domain, unlike in transaction systems, there is typically no strong demand for real-time response, and large model

training often takes hours. Therefore, the increase in response time due to TEE is negligible compared to the enhancement in security it provides. The TEE-based MatSwarm fully meets the

performance requirements for model prediction in the materials science field. SCALABILITY TESTING (RQ4) In this experiment, we evaluated the scalability of MatSwarm by examining the impact

of different dataset sizes, the number of features, and the number of participants. It is important to note that our NMDMS platform operates within a limited number of material

organizations. Currently, the platform accommodates up to 30 registered material organizations, with typically no >10 participants in a sharing task. Therefore, in our experiments, we

tested the framework with a maximum of 15 participants (material organizations). 1) Dataset size: Fig. 5a illustrates the MSE and response time of MatSwarm across varying dataset sizes. The

results indicate that dataset size has a negligible impact on the response time of MatSwarm, while the model accuracy continues to improve with increasing amounts of data. Notably, even when

each organization’s dataset comprises only 30% of the original dataset, our method demonstrates high accuracy. This indicates that our approach can achieve highly accurate training models

even with small sample sizes within each organization, effectively addressing the small sample problem in the materials science domain. 2) Number of features: as shown in Fig. 5b, increasing

the number of features does not significantly affect the response time of MatSwarm, demonstrating good scalability in terms of computational efficiency. In terms of prediction accuracy,

even with sample features constituting only 30% of the total features, our method achieves an MSE value as low as 0.155, indicating high accuracy. Therefore, using MatSwarm, even if each

organization can only obtain a limited number of feature values, it is still possible to achieve highly accurate training models. This makes our approach particularly effective for scenarios

where organizations have limited data or feature availability, ensuring robust and reliable model performance. Furthermore, after reaching ~90% of all features, the addition of less

important features does not substantially impact accuracy. In practical applications, selecting an appropriate set of features is crucial for balancing accuracy and efficiency, often

involving feature extraction optimization methods39,40. 3) Number of participants: as shown in Fig. 5c, the response time of MatSwarm increases linearly with the number of participants. This

increase is primarily due to the additional time required for communication and data coordination. The observed increase in response time aligns with theoretical expectations. In terms of

accuracy, the prediction accuracy of MatSwarm shows a notable upward trend as the number of participants increases. However, after a certain threshold, the accuracy may slightly decline due

to issues such as communication delays, data inconsistencies, and model overfitting introduced by a higher number of participants. Therefore, in the participant selection process, more is

not necessarily better. This demonstrates that MatSwarm can effectively learn the data characteristics of each organization, achieving highly accurate training models without the need for a

large number of participants for collaborative training. Consequently, this approach can also enhance the efficiency of model training. DISCUSSION ADVANTAGES OF THE MATSWARM FRAMEWORK

Security: MatSwarm incorporates advanced security measures to ensure data confidentiality and integrity. A key component of our security strategy is the use of TEEs, specifically Intel SGX,

which protect code and data from external attacks during computation. This approach effectively mitigates poisoning attacks associated with traditional FL setups. Furthermore, our

experimental setup included various attack scenarios to test the resilience of the MatSwarm framework. These tests demonstrated that MatSwarm effectively maintains data integrity and model

accuracy, even in the face of malicious attempts to corrupt the training process. Feasibility: MatSwarm is crucial for enabling collaborative computation over non-i.i.d. material data, a

common challenge due to the diverse nature of data sources and formats in this field. Compared to independent training by organizations and other FL methodologies, our method significantly

improves prediction accuracy and generalization ability. This highlights MatSwarm’s potential to unlock the full value of material data, facilitating more informed and accurate materials

discovery and development processes. Extensive testing with real-world data from the material science domain validated the usability of the MatSwarm framework. By engaging with actual

datasets from participating institutions, we demonstrated the feasibility and accuracy of the models generated through our platform. This use of real data underscores the framework’s ability

to address the ‘data silos’ problem prevalent in materials science. Scalability: MatSwarm has been rigorously tested across multiple dimensions, including varying dataset sizes, feature

quantities, and participant counts. The results show that the model maintains high and stable predictive accuracy, demonstrating excellent scalability and practical applicability. This

consistent performance, even with smaller sample sizes and fewer features, underscores MatSwarm’s capability to adapt to a broad range of scenarios. Such robustness enhances its potential

for widespread adoption in collaborative settings that require handling complex, heterogeneous data landscapes. Additionally, the MatSwarm platform utilizes a modular architecture, allowing

participants to select appropriate local models and aggregation methods based on their training tasks. As task demands increase, we will continuously iterate and update the platform’s local

models and aggregation methods. This approach aims to address various challenges in the material science domain, including performance prediction, material classification, and structural

optimization, ultimately creating a versatile collaborative computing platform. Adaptability: MatSwarm is a secure collaborative computing framework designed for non-public data across

organizations on the NMDMS, specifically addressing key regression challenges in the materials science domain. In this paper, we demonstrate the capabilities of the MatSwarm framework by

using it to predict perovskite formation energies, selecting a perovskite dataset as our example case. To be noted, our framework is suitable for general regression tasks within the

materials science domain, such as predicting the elastic properties of silicon materials and optimizing the microstructure of high-performance alloys. For each shared task, participants can

choose relevant datasets from their organization based on the task’s requirements. This ensures that the framework is not restricted to specific datasets during implementation; instead, it

dynamically adapts to select appropriate local datasets according to the specific needs of each task. Moreover, although MatSwarm is specifically designed for collaborative computing in the

materials science domain, its design principles can be leveraged by other domains with similar needs to construct their own swarm-based collaborative computing frameworks. For other domains

with similar application requirements, the framework can be adapted by modifying the objective function and selecting suitable local models and aggregation methods to fit specific needs.

Additionally, in Section 6 of the Supplementary Materials, we provide a detailed guide on how to extend and apply the MatSwarm framework to other domains. LIMITATIONS OF THE MATSWARM

FRAMEWORK Implementation complexity: while incorporating TEEs enhances security and privacy, it also increases the complexity of system setup and operations, necessitating robust

infrastructure and specialized expertise. To mitigate this, we provide detailed platform deployment and configuration documentation in the supplementary materials, which stakeholders can use

to deploy new training tasks on this platform. Potential latency issues: the decentralized nature of blockchain and remote attestation based on TEEs can introduce delays in model training

and aggregation. However, in the field of materials science, real-time requirements for training are not stringent. The minor increase in latency is negligible compared to the benefits of

resolving the issue of data silos in material data. Hardware dependency: dependence on TEEs, such as Intel SGX, to protect data during computation may limit the applicability of our

framework in environments without such hardware support. Nevertheless, our demonstration system offers the option to choose whether to use TEEs to secure the confidentiality of the model

aggregation process. Even without TEE protection, data security during transmission is ensured through data encryption and secure communication channels. In the future, we plan to offer

additional privacy protection technologies, such as homomorphic encryption and differential privacy, to support a broader range of application scenarios. METHODS OVERALL ARCHITECTURE OF

MATSWARM In this section, we present our proposed framework, MatSwarm, designed for the secure sharing of material big data using swarm transfer learning combined with TEEs. Table 2

summarizes the critical symbols used in our framework. The organizations illustrated in Fig. 6 represent examples of entities involved in materials science. It is noteworthy that our

MatSwarm framework is primarily used to address challenges in collaborative computing within the domain of materials science, as evidenced by its application to a regression problem, such as

predicting material properties like perovskite formation energies, as discussed in this paper. Nevertheless, the framework possesses the potential for extension and application in other

fields facing analogous collaborative computing challenges. Further elaboration on this aspect can be found in Supplementary Note 5. The MatSwarm framework enables collaborative computing

tasks between material organizations. As depicted in Fig. 6, MatSwarm involves multiple organizations (denoted as _N_) collaborating to execute shared tasks. Each organization is responsible

for training its own local models. The blockchain nodes provide a distributed computing environment for the participating organizations and store aggregated models. Additionally, the

trusted execution environment ensures the secure aggregation of local model parameters and collaborates with the blockchain to generate the swarm global model. * 1. Organizations: within the

MatSwarm framework, organization _O_i (1 ≤ _i_ ≤ _N_) collaboratively trains models to meet shared material performance prediction requirements. Initially, each organization conducts

material features sampling locally, and the collected samples are stored as local datasets on their respective cloud servers. Subsequently, organizations choose an appropriate machine

learning method to train a local model. To ensure security during model training, each organization deploys at least one blockchain node on an Intel SGX-enabled cloud server, and the local

model training is performed in SGX’s application enclave. This setup establishes encrypted and authenticated channels, allowing sensitive data to be securely transferred between the cloud

server and the Intel SGX Enclave. * 2. Blockchain Network: MatSwarm leverages the decentralized nature of blockchain to create a collaborative computing environment. Each organization joins

the blockchain network at local blockchain nodes. Within the MatSwarm framework, three transaction types are defined: retrieval, sharing, and uploading. The retrieval transaction verifies

the existence of relevant sharing global models on the blockchain before initiating a new sharing task. The sharing transactions involve organizations initiating new tasks, such as material

performance prediction, with the option for other organizations to participate. The uploading transactions store the final global model on the blockchain, ensuring its integrity and

preventing tampering, thus facilitating model retrieval and usage by other organizations. * 3. Trusted Execution Environment: the TEE, implemented via Intel SGX, ensures the confidentiality

and integrity of local and global models. Each organization applies for two Application Enclaves (denoted as AE) in SGX. AE1 is used to load encrypted local datasets and execute local

models, ensuring confidentiality and integrity during execution. AE2 is used to aggregate global models. This approach ensures the integrity of model aggregation, with all organizations

automatically executing the same model aggregation code through smart contracts41 in AE2. Smart contracts automate the enforcement and management of agreed-upon processes and conditions,

ensuring consistent execution, eliminating discrepancies, enhancing security, and reducing reliance on third-party intermediaries. Additionally, the Quoting Enclave (denoted as QE) generate

attestation REPORT R to assist in remote authentication between AEs in various organizations. PROBLEM FORMULATION We consider a MatSwarm framework constructed by _N_(_N_ > 2)

organizations, where _K_(_K_ ≤ _N_) organizations are in a sharing task, each possessing a local dataset DLi, ∀ _i_ ∈ _K_. Each organization maintains a local model

${f}_{{{{{\boldsymbol{\theta }}}}}_{{{{\rm{i}}}}}}:{{{{\bf{X}}}}}_{{{{\rm{i}}}}}\to {y}_{{{{{\rm{pre}}}}}_{{{{\rm{i}}}}}}$ with parameters _Θ_i, where Xi and

${y}_{{{{{\rm{pre}}}}}_{{{{\rm{i}}}}}}$ denote the input and output spaces, respectively. In our study, we assume that all organizations have the same input/output specifications and

homogeneous local model architectures. However, they may choose different local models and aggregation methods based on the sharing task. The objective is to collaboratively train the local

models to ensure that each generalizes well on the joint data distribution, thereby improving prediction accuracy for non-i.d.d. material data. To achieve this objective, we propose a swarm

transfer learning method within the MatSwarm framework. The core of our method is to identify invariances between resource-rich source domains and resource-scarce target domains,

facilitating the learning of common representation spaces and enabling knowledge transfer across domains. The objective function reveals that during the swarm transfer learning process

between organization _O_i and organization _O_i+1, local model training is interdependent, necessitating the exchange of intermediate training results. The training process adheres to a

linear cycle method, with sequential training conducted between organizations in the order [_O_1→_O_2→. . .→_O_K→_O_1]. The completion of training between organizations _O_K and _O_1

signifies the end of a local training round. After each round of local model parameter updates, the parameters are aggregated, and the updated global model parameters are sent back to each

organization for the next round of local model updates. This iterative process continues until the model converges to a specified threshold. The training objective is typically formulated as

the following algorithm: $${\min}_{{{{{\boldsymbol{\theta

}}}}}_{{{{\rm{i}}}},{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K}}f({{{{\bf{X}}}}}_{{{{\rm{i}}}},{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K},{y}_{{{{\rm{i}}}}{{\mathrm{mod}}}\,K})=

{\sum}_{{{{\rm{i=1}}}}}^{{{{\rm{K}}}}}{{{{\mathscr{L}}}}}_{1}({{{{\bf{X}}}}}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K},{y}_{{{{\rm{i}}}}{{\mathrm{mod}}}\,K})+\gamma

{{{{\mathscr{L}}}}}_{2}({{{{\bf{X}}}}}_{{{{\rm{i}}}},{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K})\\ +\frac{\lambda }{2}\left(\parallel {{{{\boldsymbol{\theta }}}}}_{{{{\rm{i}}}} \,

{{\mathrm{mod}}}\,K}{\parallel }^{2}+\parallel {{{{\boldsymbol{\theta }}}}}_{{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K}{\parallel }^{2}\right)$$ (1)

$${{{{\mathscr{L}}}}}_{1}({{{{\bf{X}}}}}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K},{y}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K})={({y}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K}-\varphi

({{{{\bf{X}}}}}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K}))}^{2}$$ (2) $${{{{\mathscr{L}}}}}_{2}({{{{\bf{X}}}}}_{{{{\rm{i}}}},{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K})={\left\Vert {u}_{{{{\rm{i}}}} \,

{{\mathrm{mod}}}\,K}({{{{\bf{X}}}}}_{{{{\rm{i}}}} \, {{\mathrm{mod}}}\,K})-{u}_{{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K}({{{{\bf{X}}}}}_{{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K})\right\Vert }_{F}^{2}$$

(3) The loss function in Equation (1) is formulated to optimize the parameters _Θ_. It aims to minimize the overall loss by integrating multiple components, including

${{{{\mathscr{L}}}}}_{1}$, ${{{{\mathscr{L}}}}}_{2}$, and regularization terms. ${{{{\mathscr{L}}}}}_{1}$: This term represents the first component of the loss function, capturing the

discrepancy between the predicted outputs and the true labels. Specifically, ${y}_{{{{\rm{i}}}}{{\mathrm{mod}}}\,K}$ denotes the label of organization _O__i_. The form of the objective

function _φ_ depends on the nature of the sharing task, such as classification or regression, and the chosen local model. ${{{{\mathscr{L}}}}}_{2}$: The term ${{{{\mathscr{L}}}}}_{2}$

typically corresponds to a regularization technique, such as L2 regularization, which helps prevent overfitting and promotes model generalization, where _u_ denotes the representation

converted from the original data, and $\parallel \cdot {\parallel }_{F}^{2}$ refers to the square of the Frobenius norm. It penalizes large parameter values to create a more balanced and

robust model. _γ_: This parameter _γ_ represents the weight assigned to the ${{{{\mathscr{L}}}}}_{2}$ component in the overall loss function. Adjusting _γ_ controls the trade-off between

fitting the training data and applying regularization. _λ_: The parameter _λ_ determines the weight assigned to the regularization terms that penalize the magnitude of the parameters

${{{{\boldsymbol{\theta }}}}}_{{{{\rm{i}}}}{{\mathrm{mod}}}\,K}$ and ${{{{\boldsymbol{\theta }}}}}_{{{{\rm{i}}}}+1{{\mathrm{mod}}}\,K}$. It controls the strength of the regularization,

helping manage the model’s complexity. Based on the above illustration, Equation (1) represents a combined loss function designed to optimize the parameters _Θ_. This function integrates the

task-specific loss ${{{{\mathscr{L}}}}}_{1}$, a regularization term ${{{{\mathscr{L}}}}}_{2}$, and a regularization of parameter magnitudes. The components are weighted by _γ_ and _λ_

to achieve a balance between data fitting and model complexity control. To ensure the confidentiality and integrity of the local model and global model generation, MatSwarm incorporates TEE

utilizing Intel SGX. Each organization’s cloud server is enabled with Intel SGX and includes _M_(_M_ ≥ _N_) blockchain nodes within the blockchain network. Each organization can deploy

multiple blockchain nodes. To facilitate understanding, we assume that each organization has deployed a single blockchain node, denoted as BNi (_i_ ≤ _M_). The blockchain node BNi must be

deployed on an Intel SGX-enabled cloud server. To train local and global models, each organization requests the creation of two AEs. AE1 is used to load encrypted local datasets and smart

contract SC1 for training local models. AE2 is used to load encrypted local model parameters and smart contract SC2 for aggregating local model parameters. WORKING MECHANISMS The overall

working mechanism of MatSwarm includes three main stages: task submission, task execution, and task archive. Videos on the procedures and operations of MatSwarm are available as

Supplementary Movies 2 and 3. 1) Task Submission: assume that all material organizations willing to conduct joint model training have registered and stored their metadata on the blockchain.

Organization _O_1, as the task issuer among participants, initiates a retrieval transaction request to the local blockchain node with a task information digest. The local blockchain node

retrieves the blockchain history to determine whether an archived task related to the task information digest has been generated. If such a task exists, the corresponding retrieval result is

returned. If organization _O_1 does not obtain retrieval results for the archived task, it will retrieve the metadata of organizations from the blockchain. Once the task issuer identifies

the organizations to be invited, it will design the sharing task scheme, including the task description, metadata description, and the selection of local models and aggregation methods. The

task issuer subsequently initiates a sharing transaction request to organizations with relevant datasets to join the sharing task as participants. The blockchain nodes of participating

organizations become active nodes, while those of non-participating organizations remain passive. The active nodes participate in the global model consensus mechanism for the task. 2) Task

Execution: to facilitate model aggregation, it is essential to standardize the structure and format of the input datasets among participants. The task issuer should create a virtual dataset

and broadcast it to the blockchain network, enabling each participant to align their local datasets with the standardized format. Subsequently, participants can use their standardized

datasets to train their local models. The task issuer trains a local model and deploys the code into Smart Contract 1 (SC1) running in its AE1. Other participants can invoke SC1 via the

blockchain to train their local models in a similar manner, ensuring uniformity in the local model training code. After each round of local model training, remote attestation is required

between the AE1 of each participant to verify the credibility of the remote nodes and the integrity and confidentiality of the local model. Following remote attestation, encrypted local

models are shared among organizations to generate the global model. To ensure the integrity and confidentiality of the aggregation process, each organization’s blockchain node performs model

aggregation in its AE2. The steps involved are as follows: the task issuer deploys the aggregation algorithm onto the smart contract SC2 running in the AE2. Other participants invoke SC2

from the blockchain, subsequently loading the smart contract and encrypted local model sets submitted by others into their respective AE2. Each participant’s AE2 independently aggregates the

models to generate a global model. To ensure the credibility of each organization’s AE2 and the integrity of the global model, the blockchain network must receive all attestation reports

generated by each organization’s AE2. Consequently, the blockchain nodes complete remote attestation through a consensus mechanism. 3) Task Archive: during each round of training,

organizations obtain the current global model and use it to update their local model until the loss function converges to a specific threshold. However, before a credible global model is

ultimately generated, a consensus must be reached among participants. Once a consensus is achieved, the global model is stored on the blockchain to prevent tampering. Therefore, participants

must ensure that the final global model is recognized by all participants through a consensus mechanism. The model is then securely stored for future retrieval and use. LOCAL MODEL

GENERATION The initial step in local model training involves loading the encrypted local dataset. To ensure security, Intel SGX’s AE only accepts encrypted data. Therefore, before sending

the local dataset DLi to the local AE1, the blockchain node _B__N__i_ must encrypt it using a symmetric encryption algorithm such as advanced encryption standard (AES)42 or Triple Data

Encryption Standard43, represented as Er(. , Kr). Symmetric encryption and decryption between BNi and its AE1 are performed using the key Kri. This process is denoted as Er(. , Kri). The key

Er(. , Kr) is transmitted through a secure channel established by the Diffie-Hellman key exchange protocol44, which allows two parties to establish a shared secret over an unsecured

communication channel, providing a foundation for encrypting further communications. BNi generates an encrypted local dataset Er(DLi, Kri) and sends it to _A__E_1. Upon receipt, AE1 uses the

key Kri to decrypt Er(DLi, Kri)(1 ≤ _i_ ≤ _M_), obtaining the plaintext DLi of the local dataset. This process can be represented as BNi∣Er(DLi, Kri) → AE1∣Dr(Er(DLi, Kri), Kri). The second

step involves local model training. Organizations deploy the local model using the smart contract SC1 to train their local datasets in AE1. For local model training, participants on our

platform can select the appropriate machine learning model based on their task requirements, including MLP45, Lasso46, RNN47, and LSTM48. As the platform evolves, it will offer a broader

range of local training models to meet diverse task requirements. In this paper, we demonstrate the local model training process using an example of predicting perovskite formation energy,

employing the MLP neural network for local model training and the stochastic gradient descent algorithm for parameter updating. As shown in Fig. 7, consider the training between organization

_O_1 and organization _O_2 as an example. Organization _O_2 calculates intermediate results and encrypts them with the public key PKIAS from Intel Authentication Service (IAS). The

encrypted intermediate results ${[{{{{\bf{u}}}}}_{2}^{{{{\rm{t}}}}}]}_{{{{{\bf{PK}}}}}_{{{{\rm{IAS}}}}}}$ and

${[{{{{\boldsymbol{\theta}}}}}_{2}^{{{{\rm{t}}}}}]}_{{{{{\bf{PK}}}}}_{{{{\rm{IAS}}}}}}$ are sent to organization _O_1. The organization _O_1 decrypts the intermediate results using the

private key PRIAS and calculates the local model gradient $\frac{\partial {{{{\mathscr{L}}}}}_{1}^{{{{\rm{t}}}}}}{\partial {{\boldsymbol{\theta}} }_{1}^{{{{\rm{t}}}}}}$ and loss function

${{{{\mathscr{L}}}}}_{1}^{{{{\rm{t}}}}}$. Similarly, organization _O_1 calculates a set of intermediate results, encrypts them with the public key PKIAS, and sends the encrypted

intermediate results ${[{{{{\boldsymbol{\theta }}}}}_{1}^{{{{\rm{t}}}}}]}_{{{{{\bf{PK}}}}}_{{{{\rm{IAS}}}}}}$ and

${[{{{{\bf{u}}}}}_{1}^{{{{\rm{t}}}}}]}_{{{{{\bf{PK}}}}}_{{{{\rm{IAS}}}}}}$ are sent to organization _O_2, which then calculates the local model gradient \(\frac{\partial

{{{{\mathscr{L}}}}}_{2}^{{{{\rm{t}}}}}}{\partial {{{{\boldsymbol{\theta }}}}}_{2}^{{{{\rm{t}}}}}}\) and loss function ${{{{\mathscr{L}}}}}_{2}^{t}$. Both organizations update their local

model parameters ${{{{\boldsymbol{\theta }}}}}_{1}^{{{{\rm{t+1}}}}}$ and ${{\boldsymbol{\theta }}}_{2}^{{{{\rm{t+1}}}}}$ using the calculated local model gradients. After each

organization completes this round of local model training, the blockchain nodes of each organization perform remote certification of all AE1 through a consensus mechanism. Subsequently,

organizations encrypt and share the updated local model parameters ${{{{\boldsymbol{\theta }}}}}_{i}^{{{{\rm{t+1}}}}}$ with other participants to aggregate local model parameters for the

current round. GLOBAL MODEL GENERATION This section will elaborate on generating global models, covering crucial aspects such as smart contract deployment, model aggregation, remote

attestation, and consensus mechanisms. Figure 8 illustrates the process of global model generation. 1) Smart contract deployment: the task issuer _O_1 deploys the model aggregation algorithm

to the blockchain via the local blockchain node BN1 as a smart contract. Each participant can retrieve and invoke the smart contract from the blockchain. The blockchain node BNi loads the

smart contract and encrypted local model parameter set ${[[{{{{\bf{M}}}}}_{{{{\rm{L}}}}}^{{{{\rm{t}}}}}]]}_{{{{{\bf{PK}}}}}_{IAS}}$ into AE2 of TEEi. The parameters are then decrypted

using PRIAS to construct the plaintext of the local model parameter set \({{{{\bf{M}}}}}_{{{{\rm{L}}}}}^{{{{\rm{t}}}}}=\left({{{{\boldsymbol{\theta

}}}}}_{1}^{{{{\rm{t}}}}},{{{{\boldsymbol{\theta }}}}}_{{{{\rm{2}}}}}^{{{{\rm{t}}}}},\cdots \,,{{{{\boldsymbol{\theta }}}}}_{{{{\rm{K}}}}}^{{{{\rm{t}}}}}\right)\). The calculation of global

model parameters occurs in AE2 of TEEi to ensure the confidentiality of sensitive parameters. Smart contracts facilitate the transfer of global model parameters. 2) Model aggregation: the

global model parameters ${{{{\bf{M}}}}}_{{{{\rm{G}}}}}^{{{{\rm{t+1}}}}}$ calculated in each participant’s AE2 are given by:

$${{{{\bf{M}}}}}_{{{{\rm{G}}}}}^{{{{\rm{t+1}}}}}={\sum}_{i=1}^{K}\frac{| {{{{\rm{DL}}}}}_{{{{\rm{i}}}}}| }{n}{{{{\bf{M}}}}}_{{{{{\rm{L}}}}}_{{{{\rm{i}}}}}}^{{{{\rm{t}}}}},n={\sum}_{i=1}^{K}|

{{{{\rm{DL}}}}}_{{{{\rm{i}}}}}|$$ (4) where ${{{{\bf{M}}}}}_{{{{\rm{G}}}}}^{{{{\rm{t+1}}}}}$ represents the global model updated in the _t_ + 1 round, _K_ denotes the number of

participants, ∣DLi∣ represents the number of samples used by the _i_-th participant to train the local model, and _n_ is the total number of samples used to train all local models.

${{{{\bf{M}}}}}_{{{{{\rm{L}}}}}_{{{{\rm{i}}}}}}^{{{{\rm{t}}}}}$ is the local model parameters set updated by the _i_-th participant in the _t_ rounds. Notably, parameter aggregation is

illustrated using the Mean method35 in MatSwarm, which is the most widely used approach. However, various aggregation methods are available, such as MultiKrum49, CenteredClipping50,

GeoMed51, and Median52, among others. On our platform, participants can choose different model aggregation methods based on task requirements and robustness needs. 3) Remote attestation:

during the global model generation process, remote attestation is used to verify the integrity of the global model generated by AE2. In this method, the blockchain node BNi facilitates

interaction between the AE2 of TEEi and the blockchain network, serving as both an aggregator and verifier in the attestation process. TEEi generates a REPORT structure information Ri

containing the current enclave identity information Mi, and other metadata through the EREPORT function, and signs Ri to produce a Message Authentication Code (MAC) tag MACi. AE2 sends Ri

and MAC tags to the Quoting Enclave in TEEi for mutual attestation. The Quoting Enclave calls the EGETKEY command to decrypt the MACi and verifies the decrypted information against Ri. After

successful mutual attestation within TEEi, the Quoting Enclave uses the private key (AKRi) of the attestation key generated by the Intel Provisioning Service’s Provisioning Seal Key, to

sign Ri and create a Quote QGi = Sign(Ri, AKRi). Only Quoting Enclave can access the key used for attestation in the Intel Provisioning Service to verify the credibility of TEEi. The QGi is

then sent through the blockchain network to the blockchain nodes of other participants for verification. Once BNi receives K-1 Quotes, it will verify each Quote using the public key AKPi

generated by the Intel Provisioning Service. The verification is completed utilizing the function _v__e__r__i__f__y_(Sign(QGi, AKRi), AKPi). Once the Quote is validated, BNi will extract the

global model

${{{{\bf{M}}}}}_{{{{\rm{G}}}}}^{{{{\rm{t+1}}}}}$=$\left({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{1}}}}}}^{{{{\rm{t+1}}}}},{{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{2}}}}}}^{{{{\rm{t+1}}}}},\ldots,{{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{K}}}}}}^{{{{\rm{t+1}}}}}\right)$

from QG = $\left({{{{\bf{QG}}}}}_{1},{{{{\bf{QG}}}}}_{2},\ldots,{{{{\bf{QG}}}}}_{{{{\rm{K}}}}}\right)$ for subsequent consensus. 4) Global Model Consensus: at this stage, the consensus

mechanism is used to determine the global model accepted by the participants. We use the PBFT53 consensus, which can tolerate _f_ Byzantine fault nodes. We assume that three organizations

are participating in the shared task. The blockchain node BN1 is a blockchain node of the task issuer _O_1 acting as the primary node; BN2 and BN3 are blockchain nodes of the other two

participants _O_2 and _O_3 participating in the consensus mechanism as active nodes; BNj (_j_ ∈ _M_) denotes the blockchain node of organizations that are not participating in the sharing

task, referred to as passive nodes. The consensus mechanism consists of five steps: request, pre-prepare, prepare, commit, and reply. Request phase: the task issuer _O_1 initiates a global

model consensus request to the deployed blockchain node BN1. Pre-prepare stage: BN1 calculates

Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{1}}}}}}^{{{{\rm{t+1}}}}},{{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{2}}}}}}^{{{{\rm{t+1}}}}},\ldots,{{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{K}}}}}}^{{{{\rm{t+1}}}}})$

and broadcasts Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{1}}}}}}^{{{{\rm{t+1}}}}})$ to BN2 and BN3 if the Hash of all global models is equal. Prepare stage: after receiving the

Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{1}}}}}}^{{{{\rm{t+1}}}}})$ sent by BN1, BN2, and BN3, calculate the Hash value of the global model

${{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{i}}}}}}^{{{{\rm{t+1}}}}}(1\le i\le K)$ sent by each organization. If all hash values are equal to

Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{1}}}}}}^{{{{\rm{t+1}}}}})$, BN2 and BN3 broadcast Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{2}}}}}}^{{{{\rm{t+1}}}}})$ and

Hash$({{{{\bf{M}}}}}_{{{{{\rm{G}}}}}_{{{{\rm{3}}}}}}^{{{{\rm{t+1}}}}})$ to the other two participants, respectively. Commit stage: after receiving the calculation results from the other

nodes, all participants verify whether a consistent global model has been agreed upon by all. If consensus is achieved, they broadcast confirmation messages to the other participants. Reply

stage: the consensus request is considered complete when each participant receives confirmation messages from at least two-thirds of the nodes. A Reply message is then constructed and sent

to _O_1. Once _O_1 receives confirmation messages from more than two-thirds of the nodes, it finalizes the global model and broadcasts its hash to all active and passive nodes for storage.

The processes of local and global model generation are repeated round and round until the model converges to a threshold. In the end, the final global model is stored on the blockchain,

ensuring tamper resistance while facilitating efficient retrieval by others. REPORTING SUMMARY Further information on research design is available in the Nature Portfolio Reporting Summary

linked to this article. DATA AVAILABILITY All datasets used are publicly available at https://github.com/SICC-Group/MatSwarm.git and Zenodo54. All data supporting the findings described in

this manuscript are available in the article and in the Supplementary Information and from the corresponding author upon request. Source data are provided with this paper. CODE AVAILABILITY

The codes are available in open source at https://github.com/SICC-Group/MatSwarm.git and Zenodo54. REFERENCES * Liu, C. et al. A transfer learning cnn-lstm network-based production progress

prediction approach in iiot-enabled manufacturing. _Int. J. Prod. Res._ 61, 4045–4068 (2023). Article Google Scholar * Chaudry, U. M., Hamad, K. & Abuhmed, T. Machine learning-aided

design of aluminum alloys with high performance. _Mater. Today Commun._ 26, 897 (2021). Google Scholar * Malik, P. K. et al. Industrial internet of things and its applications in industry

4. _Comput. Commun._ 166, 125–139 (2021). Article Google Scholar * Damewood, J. et al. Representations of materials for machine learning. _Annu. Rev. Mater. Res._ 53, 399–426 (2023).

Article ADS CAS Google Scholar * Stergiou, K. et al. Enhancing property prediction and process optimization in building materials through machine learning: a review. _Comput. Mater.

Sci._ 220, 031 (2023). Article Google Scholar * Aflow - Automatic FLOW for materials discovery. https://aflowlib.org/ (2024). * Crystallography open database.

http://www.crystallography.net/cod/ (2024). * Materials data repository home. https://materialsdata.nist.gov/. (2024). * Morgan, D. & Jacobs, R. Opportunities and challenges for machine

learning in materials science. _Annu. Rev. Mater. Res._ 50, 71–103 (2020). Article ADS CAS Google Scholar * Xu, P., Ji, X., Li, M. & Lu, W. Small data machine learning in materials

science. _npj Comput. Mater._ 9, 42 (2023). Article ADS Google Scholar * Kim, Y. et al. Deep learning framework for material design space exploration using active transfer learning and

data augmentation. _npj Comput. Mater._ 7, 140 (2021). Article ADS Google Scholar * Jain, S., Seth, G., Paruthi, A., Soni, U. & Kumar, G. Synthetic data augmentation for surface

defect detection and classification using deep learning. _J. Intell. Manuf._ 33, 1007–1020 (2022). Article Google Scholar * Hnewa, M. & Radha, H. Object detection under rainy

conditions for autonomous vehicles: a review of state-of-the-art and emerging techniques. _IEEE Signal Process. Mag._ 38, 53–67 (2020). Article Google Scholar * Wen, Y., Tran, D.,

Izmailov, P., Wilson, A.G. Combining ensembles and data augmentatio.n can harm your calibration. _In_: International Conference on Learning Representations https://arxiv.org/abs/2010.09875

(2021). * Lejeune, E. & Zhao, B. Exploring the potential of transfer learning for metamodels of heterogeneous material deformation. _J. Mech. Behav. Biomed. Mater._ 117, 104,276 (2021).

Article CAS Google Scholar * Zhang, C. et al. A survey on federated learning. _Knowl. Based Syst._ 216, 106,775 (2021). Article Google Scholar * Mothukuri, V. et al. A survey on

security and privacy of federated learning. _Future Gener. Comput. Syst._ 115, 619–640 (2021). Article Google Scholar * Kairouz, P. et al. Advances and open problems in federated learning.

_Found. Trends Mach. Learn._ 14, 1–210 (2021). Article Google Scholar * Zhang, J. et al. Security and privacy threats to federated learning: Issues, methods, and challenges. _Secur.

Commun. Netw._ 2022 (2022). * Tolpegin, V., Truex, S., Gursoy, M.E., Liu, L. Data poisoning attacks against federated learning systems. _In_: Computer Security–ESORICS 2020: 25th European

Symposium on Research in Computer Security, pp. 480–501 (2020). * Xiao, X., Tang, Z., Li, C., Xiao, B. & Li, K. Sca: sybil-based collusion attacks of iiot data poisoning in federated

learning. _IEEE Trans. Ind. Inform._ 19, 2608–2618 (2022). Article Google Scholar * Bakopoulou, E., Tillman, B. & Markopoulou, A. Fedpacket: a federated learning approach to mobile

packet classification. _IEEE Trans. Mob. Comput._ 21, 3609–3628 (2021). Article Google Scholar * Wang, B., Li, A., Pang, M., Li, H., Chen, Y. Graphfl: a federated learning framework for

semi-supervised node classification on graphs. _In_: 2022 IEEE International Conference on Data Mining (ICDM) pp. 498–507 (2022). * Xie, J., Su, Y., Zhang, D. & Feng, Q. A vision of

materials genome engineering in china. _Engineering_ 10, 10–12 (2022). Article Google Scholar * Wang, R. et al. A secured big-data sharing platform for materials genome engineering:

state-of-the-art, challenges and architecture. _Future Gener. Comput. Syst._ 142, 59–74 (2023). Article Google Scholar * Wang, R., Xu, C., Ye, F., Tang, S., Zhang, X., S-mbda: a

blockchain-based architecture for secure storage and sharing of material big-data. _IEEE Internet Things J_. 11, 15 (2024). * Liu, S. et al. An infrastructure with user-centered presentation

data model for integrated management of materials data and services. _npj Comput. Mater._ 7, 88 (2021). Article ADS CAS Google Scholar * Ileana, M., Oproiu, M.I., C.V., Marian, Using

docker swarm to improve performance in distributed web systems. _In_: International Conference on Development and Application Systems (DAS) pp. 1–6 (2024). * Jere, M. S., Farnan, T. &

Koushanfar, F. A taxonomy of attacks on federated learning. _IEEE Secur. Priv._ 19, 20–28 (2020). Article Google Scholar * Romano, Y., Aberdam, A., Sulam, J. & Elad, M. Adversarial

noise attacks of deep learning architectures: stability analysis via sparse-modeled signals. _J. Math. Imaging Vis._ 62, 313–327 (2020). Article MathSciNet Google Scholar * Fang, M., Cao,

X., Jia, J., Gong, N., Local model poisoning attacks to byzantine-robust federated learning. 29th USENIX security symposium (USENIX Security 20), pp. 1605–1622 (2020). * Li, L., Xu, W.,

Chen, T., Giannakis, G. B. & Ling, Q. Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. _Proc. AAAI Conf. Artif. Intell._ 33,

1544–1551 (2019). Google Scholar * Baruch, G., Baruch, M., Goldberg, Y., A little is enough: circumventing defenses for distributed learning. _Adv. Neural Inf. Process. Syst._, 32 (2019). *

Xie, C., Koyejo, O., Gupta, I. Fall of empires: breaking byzantine-tolerant SGD by inner product manipulation. https://arxiv.org/abs/1903.03936 (2020). * Li, X., Huang, K., Yang, W., Wang,

S., Zhang, Z. On the convergence of fedavg on non-iid data. _In_: International Conference on Learning Representations, https://openreview.net/forum?id=HJxNAnVtDS (2020). * Li, T. et al.

Federated optimization in heterogeneous networks. _Proc. Mach. Learn. Syst._ 2, 429–450 (2020). Google Scholar * Liu, Y., Kang, Y., Xing, C., Chen, T. & Yang, Q. Secure federated

transfer learning. _IEEE Intell. Syst._ 35, 70–82 (2020). Article Google Scholar * Kalapaaking, A. P. et al. Blockchain-based federated learning with secure aggregation in trusted

execution environment for internet-of-things. _IEEE Trans. Ind. Inform._ 19, 1703–1714 (2022). Article Google Scholar * Chowdhury, S., Mayilvahanan, P. & Govindaraj, R. Optimal feature

extraction and classification-oriented medical insurance prediction model: machine learning integrated with the internet of things. _Int. J. Comput. Appl._ 44, 278–290 (2022). Google

Scholar * Fatani, A., Dahou, A., Al-Qaness, M. A., Lu, S. & Abd Elaziz, M. Advanced feature extraction and selection approach using deep learning and aquila optimizer for iot intrusion

detection system. _Sensors_ 22, 140 (2022). Article ADS Google Scholar * Hewa, T., Ylianttila, M. & Liyanage, M. Survey on blockchain based smart contracts: applications,

opportunities and challenges. _J. Netw. Comput. Appl._ 177, 102,857 (2021). Article Google Scholar * Daemen, J. & Rijmen, V. Reijndael: the advanced encryption standard. _Dobb’s. J._

26, 137–139 (2001). Google Scholar * Barker, E., Mouha, N. Recommendation for the triple data encryption algorithm (tdea) block cipher. Technical report, National Institute of Standards and

Technology (2017). * Naresh, V., Sivaranjani, R. & Murthy, N. Provable secure lightweight multiple shared key agreement based on hyper elliptic curve diffie-hellman for wireless sensor

networks. _Int. J. Crit. Infrastruct. Prot._ 28, 100,371 (2020). Google Scholar * Trzepieciński, T. & Lemu, H. G. Improving prediction of springback in sheet metal forming using

multilayer perceptron-based genetic algorithm. _Materials_ 13, 3129 (2020). Article ADS PubMed PubMed Central Google Scholar * Maulud, D. & Abdulazeez, A. M. A review on linear

regression comprehensive in machine learning. _J. Appl. Sci. Technol. Trends_ 1, 140–147 (2020). Article Google Scholar * Wu, L. et al. A recurrent neural network-accelerated multi-scale

model for elasto-plastic heterogeneous materials subjected to random cyclic and non-proportional loading paths. _Comput. Methods Appl. Mech. Eng._ 369, 113,234 (2020). Article MathSciNet

Google Scholar * Meng, H., Geng, M. & Han, T. Long short-term memory network with bayesian optimization for health prognostics of lithium-ion batteries based on partial incremental

capacity analysis. _Reliab. Eng. Syst. Saf._ 236, 109,288 (2023). Article Google Scholar * Blanchard, P., El Mhamdi, E.M., Guerraoui, R., Stainer, J., Machine learning with adversaries:

byzantine tolerant gradient descent. _In_: International Conference on Neural Information Processing Systems p. 118–128 (2017). * Karimireddy, S.P., He, L., Jaggi, M., Learning from history

for byzantine robust optimization. _In_: International Conference on Machine Learning, pp. 5311–5319 (2021). * Chen, Y., Su, L. & Xu, J. Distributed statistical machine learning in

adversarial settings: Byzantine gradient descent. _Proc. ACM Meas. Anal. Comput. Syst._ 1, 1–25 (2017). CAS Google Scholar * Yin, D., Chen, Y., Kannan, R., Bartlett, P., Byzantine-robust

distributed learning: towards optimal statistical rates. _In_: International Conference on Machine Learning, pp. 5650–5659 (2018). * Zhang, G. et al. Reaching consensus in the byzantine

empire: a comprehensive review of BFT consensus algorithms. _ACM Comput. Surv._ 56, 1–41 (2024). Article Google Scholar * Wang, R. et al. Matswarm: trusted swarm transfer learning driven

materials computation for secure big data sharing, https://zenodo.org/records/13622509 (2024). Download references ACKNOWLEDGEMENTS This work is supported in part by the National Key

Research and Development Program of China under Grant 2021YFB3702403, and in part by the National Natural Science Foundation of China under Grant 62101029. R.W. has been supported by the

China Scholarship Council Award under Grant 202306460078. C.X. has been supported in part by the China Scholarship Council Award under Grant 202006465043. AUTHOR INFORMATION AUTHORS AND

AFFILIATIONS * School of Computer and Communication Engineering, University of Science and Technology Beijing, 100083, Beijing, China Ran Wang, Cheng Xu, Fangwen Ye, Yusen Tang, Sisui Tang,

Hangning Zhang, Wendi Du & Xiaotong Zhang * Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, 100083, Beijing, China Ran

Wang, Cheng Xu & Xiaotong Zhang * College of Computing and Data Science, Nanyang Technological University, 639798, Singapore, Singapore Ran Wang & Shuhao Zhang * Shunde Innovation

School, University of Science and Technology Beijing, 528399, Guangdong, China Cheng Xu & Xiaotong Zhang Authors * Ran Wang View author publications You can also search for this author

inPubMed Google Scholar * Cheng Xu View author publications You can also search for this author inPubMed Google Scholar * Shuhao Zhang View author publications You can also search for this

author inPubMed Google Scholar * Fangwen Ye View author publications You can also search for this author inPubMed Google Scholar * Yusen Tang View author publications You can also search for

this author inPubMed Google Scholar * Sisui Tang View author publications You can also search for this author inPubMed Google Scholar * Hangning Zhang View author publications You can also

search for this author inPubMed Google Scholar * Wendi Du View author publications You can also search for this author inPubMed Google Scholar * Xiaotong Zhang View author publications You

can also search for this author inPubMed Google Scholar CONTRIBUTIONS R.W. and C.X. conceived this project. C.X. and X.Z. funded and supervised the research. R.W. and F.Y. implemented the

algorithm, performed the experiments, and prepared the plots. Y.T., S.T., H.Z., and W.D. implemented the open-source prototype. R.W. and C.X. analyzed the results and drafted the main text.

C.X., S.Z. and X.Z. revised the manuscript. All authors commented on the manuscript. CORRESPONDING AUTHORS Correspondence to Cheng Xu or Xiaotong Zhang. ETHICS DECLARATIONS COMPETING

INTERESTS The authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION _Nature Communications_ thanks Ernestina Mensalvas and the other anonymous reviewer(s) for their

contribution to the peer review of this work. A peer review file is available. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in

published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE DESCRIPTION OF ADDITIONAL SUPPLEMENTARY FILES SUPPLEMENTARY MOVIE 1

SUPPLEMENTARY MOVIE 2 SUPPLEMENTARY MOVIE 3 REPORTING SUMMARY SOURCE DATA SOURCE DATA RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0

International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the

source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative

Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by

statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Wang, R., Xu, C., Zhang, S. _et al._ MatSwarm: trusted swarm transfer learning

driven materials computation for secure big data sharing. _Nat Commun_ 15, 9290 (2024). https://doi.org/10.1038/s41467-024-53431-x Download citation * Received: 11 November 2023 * Accepted:

07 October 2024 * Published: 28 October 2024 * DOI: https://doi.org/10.1038/s41467-024-53431-x SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content:

Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative

David beckham weighs into gary neville vs nottingham forest row with comment

GARY NEVILLE HAS BEEN BANNED FROM WORKING AT THE CITY GROUND TO COVER NOTTINGHAM FOREST VS CHELSEA ON SUNDAY, PROMPTING ...

Molecular preservation of 1. 88 ga gunflint organic microfossils as a function of temperature and mineralogy

ABSTRACT The significant degradation that fossilized biomolecules may experience during burial makes it challenging to a...

Nature chemical biology - volume 13 issue 8, august 2017

The cover depicts conidiophores of the fungus _Aspergillus nidulans_ carrying a fungal artificial chromosome (FAC), imag...

Households warned over common mistake that will damage hanging basket blooms

THE DAYS ARE WARMING UP, THE NIGHTS ARE BALMIER BUT DON'T GET COMPLACENT, EVEN A COOL NIGHT COULD DAMAGE AN UNCOVER...

Long-term data storage in DNA | Nature

Access through your institution Buy or subscribe A DNA-based system could safely store data for millennia. Today's ...

Man utd star marcus rashford's dream transfer back on as manager makes claim

MANCHESTER UNITED STAR MARCUS RASHFORD IS LOOKING INCREASINGLY UNLIKELY TO JOIN ASTON VILLA ON A PERMANENT BASIS AND THE...