In this section, we experimentally validate the fusion performance of S3IMFusion on two datasets. These include CT and MRI image fusion, SPECT and MRI image fusion and IR and visible image fusion.
Datasets and training details
In this paper, two datasets are utilized. The first is a publicly available multi-modal medical image dataset sourced from the Harvard database, which contains 350 pairs of CT/SPECT and MRI images, each with a resolution of 256 \(\times \) 256. This dataset is widely used in medical image fusion research and provides an effective benchmark for evaluating the performance of fusion models. The second dataset, RoadScene40, is employed for the task of infrared and visible image fusion. It consists primarily of pairs of infrared and visible images depicting various scenes, including streets, pedestrians, vehicles, and buildings.
The experiments are implemented via the PyTorch framework on an NVIDIA GeForce RTX 3090 GPU. During training process, model parameters are updated using an Adam optimizer with a learning rate of 0.001, the batch size of 16, and the number of epochs is 100. The hyperparameters in the loss function are set as \(\gamma _1\) = 10, \(\gamma _2\) = 5, and \(\gamma _3\) = 1.
Comparison methods and evaluation metrics
In this section, we evaluate the fusion performance of the proposed S3IMFusion by comparing it with six state-of-the-art methods: EMFusion41, IFCNN25, MATR33, MUFusion42, U2Fusion24, and DGcGAN43. IFCNN and MUFusion represent medical image fusion methods that rely exclusively on convolutional neural networks (CNNs). U2Fusion is a versatile image fusion approach that demonstrates exceptional performance not only in medical image fusion but also in infrared-visible image fusion, as well as in multi-focus and multi-exposure scenarios. MATR combines CNN and transformer architectures for image fusion, while DGcGAN leverages generative adversarial networks (GANs) to perform image fusion. EMMA44 is a self-supervised fusion method with a priori knowledge of the principles of optical imaging. INet45 is a medical image fusion method that combines discrete wavelet transform with reversible networks.
To thoroughly evaluate the fusion performance of S3IMFusion, we employ eight widely recognized image quality assessment metrics: entropy (EN)46, average gradient (AG)47, mutual information (MI)48, structural similarity index (SSIM)49, peak signal-to-noise ratio (PSNR)50, Qabf51, sum of the correlations of differences (SCD)52, and spatial frequency (SF)53. EN quantifies the information content within an image, providing insights into its richness. AG measures the average local pixel value variations and is commonly used to assess texture and detail preservation. MI evaluates the capacity of fusion methods to retain original information, with higher values indicating better information preservation. SSIM offers a holistic assessment by evaluating brightness, contrast, and structural similarity between images. PSNR quantifies the signal-to-noise ratio between the original and fused images, offering a measure of image fidelity. Qabf, based on the Bandlet transform, emphasizes spectral and spatial fidelity, as well as global consistency in image fusion evaluation. SCD analyzes pixel differences across multiple scales, providing an assessment of information retention. SF reflects the retention of fine image details, such as texture and edges, and assesses the ability of the fusion model to preserve these features. By employing these metrics, a comprehensive and objective evaluation of the fusion performance of S3IMFusion is achieved.
CT and MRI image fusion
The experimental results of our proposed S3IMFusion on the Harvard dataset are shown in Fig. 3.

Results of CT and MRI image fusion on Harvard.
To provide a more detailed and illustrative evaluation, we select two local regions (indicated by red-boxed areas) for zoomed-in visual comparisons of the fused images. Each of the compared methods exhibits distinct strengths and limitations. The DDcGAN method enhances the brightness of fused images; however, it introduces significant artifacts, leading to pronounced blurring and reduced structural integrity. EMFusion effectively integrates salient features from CT images into the fused outputs, though at the cost of losing texture details from MRI images. A similar compromise is observed in the IFCNN method. MATR demonstrates a notable ability to combine detailed texture information and salient features, yet suffers from visible blurring and reduced brightness in the fused images, particularly in the CT-MRI fusion context. MUFusion introduces undesirable noise artifacts, severely compromising the visual quality of the fusion results. U2Fusion excels at incorporating intricate details from MRI images into the fused outputs but neglects critical complementary information from CT images, resulting in a loss of balance between modalities. EMMA effectively preserves the salient features of the original image; however, the fusion result suffers from a lack of fine-grained detail, leading to insufficient representation of intricate information. INet achieves a more complete preservation of the mutual information from the original image in the fused output, owing to the reversibility of the network, which effectively mitigates information loss. Nevertheless, this approach is plagued by the issue of color distortion. In contrast, the proposed S3IMFusion method demonstrates superior performance by effectively preserving salient features from CT images while maintaining the intricate texture details from MRI images. Moreover, it achieves an optimal balance between brightness and detail preservation, resulting in fused images with enhanced visual clarity and overall quality. This capability underscores the robustness and effectiveness of S3IMFusion in handling multi-modal medical image fusion tasks.
Table 1 presents the evaluation results derived from the eight metrics mentioned earlier. This evaluation is conducted using 21 pairs of CT and MRI images. For each metric, the final score is calculated by averaging the assessment scores of the 21 test samples. From Table 1, it can be seen that the proposed S3IMFusion method performs well in EN, AG and MI. INet achieves excellent results on SSIM, Qabf and SCD metrics, due to the information lossless extraction capability of the invertible network, which allows it to retain more structural information in the image. S3IMFusion also demonstrates relatively sub-optimal results in metrics such as PSNR and SSIM. Both U2Fusion and MUFusion demonstrate superior performance in terms of PSNR and SF metrics. The comprehensive analysis underscores the stability of S3IMFusion in producing fused images and its capability to achieve higher-quality outputs by effectively integrating both global and local features from the source images.
SPECT and MRI image fusion
When fusing SPECT and MRI images, the SPECT image is initially transformed from the RGB color space to the YUV color space. In this representation, the U and V channels capture the chromaticity information of the image, while the Y channel encapsulates the luminance information. To leverage the luminance details for fusion, the Y-channel features are directly utilized in combination with the MRI image to generate the grayscale fusion result. Subsequently, the RGB fusion result is reconstructed by reintegrating the chromaticity information preserved in the U and V channels. The detailed workflow of this process is illustrated in Fig. 4.

SPECT and MRI image fusion process.
Similarly, we conducted experiments using the Harvard dataset, and the experimental comparison results are shown in Fig. 5, where the local features of the fusion results are zoomed in and labeled with green and red rectangular boxes for comparison purposes.

Results of SPECT and MRI image fusion.
As illustrated in Fig. 5, for images containing rich and intricate features, the existing methods fail to achieve a satisfactory fusion of SPECT and MRI images. The EMFusion method effectively preserves texture details from MRI images; however, it tends to lose critical structural information, particularly in organ structures such as the human eye. In contrast, the DDcGAN method excels at fusing contour information from both modalities but compromises the preservation of texture details from MRI images, thus negatively impacting the overall clarity of the fused image. Additionally, significant color distortion occurs when fusing the chromaticity information from the SPECT images. The fused images generated by IFCNN appear excessively smoothed, lacking adequate preservation of texture details from the source images. The MATR method, while successful in fusing detailed texture and salient features, suffers from over-fusion, retaining excessive chromaticity information, and neglecting important texture features from MRI images. MUFusion struggles to harmoniously integrate complementary information, resulting in fused images with low clarity. Similarly, while U2Fusion manages to retain complementary information, it introduces artifacts that degrade the overall image quality. EMMA effectively preserves contour and target information within an image; however, it is less effective at retaining edge intensity information, as exemplified by the region of the eyeball highlighted in the green box of the third result. This leads to blurring in the fused image. In contrast, INet excels at preserving detailed texture information and produces high-definition fusion results. Nevertheless, it tends to lose some intensity information, as indicated by the red rectangular box in the first fusion result, which results in the loss of edge features. In contrast, our proposed S3IMFusion method effectively preserves complementary information from both modalities, seamlessly integrating salient features from SPECT images with texture information from MRI images. Moreover, S3IMFusion generates fused images with superior clarity, retaining more chromaticity information and texture details compared to existing methods. To further assess the performance, we conduct objective evaluations of the fused images using the eight metrics previously mentioned.
As shown in Table 2, the quantitative evaluation results for S3IMFusion demonstrate its superior performance across seven metrics, including EN, AG, MI, SSIM and SF. EMMA achieves optimal performance in terms of PSNR and Qabf metrics, owing to the network training process being aligned with the principles of optical imaging. This alignment enables the network to adhere to the iso-realistic a priori, resulting in fusion outputs that are clearer and richer in detail. The superior performance of INet in SCD metrics can be attributed to its multichannel lossless feature extraction method, which enhances the consistency of the fusion results. These results align with the findings in Table 1, further highlighting the ability of S3IMFusion to maintain exceptional fusion quality, even when dealing with more complex image features. This underscores the enhanced generalization capability of S3IMFusion in comparison to other fusion methods.
Analysis of loss function
To evaluate the efficacy of the global similarity loss and random region pixel intensity loss functions proposed in this study, we conduct an ablation experiment on the loss functions. In this experiment, the proposed loss functions are replaced with traditional structural similarity loss and pixel intensity loss functions, while all other conditions are kept consistent. This approach aims to isolate and assess the specific impact of these novel loss functions on the overall performance. The \(L_{ab}\) is described in Eq. (9).
$$\begin{aligned} L_{ab}=\gamma _1(1-SSIM)+\gamma _{2}L_{grad}+\gamma _{3}L_{smooth}, \end{aligned}$$
(9)
where SSIM, \(L_{\textrm{grad}}\) and \(L_{smooth}\) denote structure similarity index, gradient loss and smoothing loss, and \(\gamma _1\), \(\gamma _2\) and \(\gamma _3\) are the corresponding weighting parameters.

Results of loss functions ablation experiments.
The experimental results are presented in Fig. 6. The fusion network trained exclusively with the general similarity loss function fails to adequately preserve global complementary information during the fusion process, as evidenced by the blurred texture details and poorly preserved salient features in the fused images. In contrast, the network guided by the proposed loss function demonstrates a significant improvement. It effectively integrates complementary features from the source images, resulting in a fused image with sharper definition and richer texture details. Similarly, the results across the eight evaluation metrics, as shown in Table 3, further corroborate these findings. From a comprehensive perspective, S3IMFusion with \(L_{total}\) exhibits substantial advantages in visual perception indices. When \(L_{total}\) is replaced with \(L_{ab}\), a notable decline is observed in the indices related to both image features and image structure in the fused images. This indicates that \(L_{total}\) plays a crucial role in enhancing edge information and preserving fine texture details in the fused image.
Extension to infrared and visible image fusion
In general, RGB camera imaging offers the advantages of rich texture and high clarity. However, in extreme weather conditions or low-light environments, a single RGB camera struggles to effectively capture the external world. In contrast, infrared cameras leverage thermal radiation to image objects, offering superior stability and reliability under challenging conditions. Therefore, the fusion of infrared and visible light images can capitalize on the complementary strengths of both camera types, resulting in fused images of higher quality. Infrared and visible light image fusion has thus emerged as a crucial subfield within multi-modal image fusion. In this work, we extend the proposed S3IMFusion framework to infrared and visible light image fusion, evaluating the generalizability of the algorithm through experiments conducted on the RoadScene dataset. Consistent with the fusion of SPECT and MRI images, we first convert the visible image from the RGB color space to the YUV color model. Fusion is then performed on the Y-channel of the visible image with the infrared image. Finally, the fusion result is transformed back to the RGB color space to reconstruct the fused image. The experimental results are shown in Fig. 6.

Results of visible and infrared image fusion.
We compare the experimental results with six existing methods. Among them, CDDFuse produces fusion results that are closest to our S3IMFusion; however, its performance suffers a reduction in clarity when fusing images with richer edge information, as seen in the region highlighted by the red rectangular box in the third set of images in Fig. 7. DATFuse and U2Fusion fail to adequately preserve the detailed texture information from the input images, resulting in blurred fusion outputs. Although DDcGAN performs well in fusing prominent features such as pedestrians, it suffers from significant color distortion and blurring, leading to substantial information loss. IFCNN and SwinFuse experience feature loss when fusing images with weak texture features, such as the streetlights marked by the green rectangular boxes in the fourth set of images. In contrast, our proposed S3IMFusion effectively addresses the limitations observed in the aforementioned methods. It successfully retains the rich texture information from the visible image while preserving the salient features from the infrared image. When confronted with targets exhibiting distinct edge distributions, such as streetlights and buildings, S3IMFusion produces clear and precise fusion results, avoiding color distortions and maintaining high image clarity.
The quantitative evaluation results are presented in Table 4, which demonstrates that the performance of S3IMFusion across the reevaluated metrics aligns well with the results shown in Fig. 7. Notably, S3IMFusion achieves optimal performance on the EN, AG, MI, PSNR, Qabf, and SF metrics. Both subjective visual assessment and objective quantitative metrics indicate that S3IMFusion performs exceptionally well, exhibiting strong scalability in the context of infrared and visible image fusion tasks.