Dataset and evaluation metrics
This study evaluates the proposed method using images selected from the MegaDepth dataset40, specifically from its MegaDepth_v1 subset. The original dataset comprises approximately 1 million internet-sourced images spanning 196 outdoor scenes. For this experiment, 15 scenes were carefully chosen, ranging from scene numbers 303 to 377, encompassing over 20,000 images and their corresponding depth maps, with a total data size of approximately 30 GB. These selected images cover a variety of scenarios, including simple, moderate, and challenging conditions, characterized by variations in illumination, scale, and viewpoint.
The experiments were conducted on the traditional SIFT method with K-nearest neighbor matching (\(K=2\)), the LoFTR model, and the improved MSpLoFTR and MSpGLoFTR models. To avoid training bias caused by a high initial learning rate, a warm-up strategy was adopted, where the learning rate was gradually increased during the initial training phase and then switched to the AdamW optimizer for the remaining training. All models were trained end-to-end using random weight initialization, and the specific training hyperparameters were set as follows: \(N_c = 4\), \(N_r = 1\), \(\theta _c = 0.2\), and the window size \(w = 5\).The experiments were conducted on a 15-core AMD EPYC 7543 32-Core Processor with 30 GB of memory and a single RTX A5000 GPU with 24 GB of memory. To comprehensively evaluate the performance of the models, the following three metrics were employed:
- 1.
AUC (area under the curve) for relative pose estimation, which reflects the estimation accuracy by calculating the area under the ROC curve.
- 2.
Inference time, including the total time for model inference, performance evaluation, and result storage.
- 3.
Matching precision, defined as the percentage of correct matches over the total number of matches.
Results and analysis
This study compares the performance of traditional SIFT combined with K-nearest neighbor matching (K = 2), the LoFTR model, the MSpLoFTR model, and the MSpGLoFTR model across different scenarios. The test images are categorized into five typical scenarios: simple scenes (characterized by minimal variations in illumination, scale, and viewpoint), viewpoint variation scenes (with significant viewpoint changes but minimal illumination and scale variations), scale variation scenes (characterized by notable scale differences but minimal illumination and viewpoint variations), illumination variation scenes (featuring substantial changes in illumination but minimal scale and viewpoint variations), and challenging scenes (with significant variations in illumination, scale, and viewpoint).The performance of these methods is evaluated by recording matching precision, computational time, and the area under the curve (AUC) for relative pose estimation in these scenarios. The comprehensive analysis of these metrics reveals the differences in model performance under various complex conditions, providing valuable insights into their adaptability and robustness.
Qualitative analysis

Performance comparison of four feature matching methods in different scenarios.
From the comparative analysis in Fig. 4, it is evident that the traditional SIFT method, despite detecting a certain number of matching points, exhibits a significant number of mismatches. This is particularly pronounced under challenging conditions such as illumination, viewpoint, and scale variations, where its robustness and adaptability are noticeably insufficient. In contrast, both MSpLoFTR and MSpGLoFTR demonstrate superior performance compared to LoFTR across all five scenarios. MSpLoFTR expands the matching coverage effectively and significantly reduces mismatches by introducing a multi-scale feature fusion strategy, exhibiting stronger adaptability in scenarios with illumination, viewpoint, and scale variations. Furthermore, MSpGLoFTR integrates a gated convolutional mechanism, dynamically adjusting the weights of critical features, achieving comprehensive optimization in terms of matching precision, the number of matching points, and mismatch control. Notably, in challenging scenarios, although the number of matching points is slightly fewer than that of LoFTR, the mismatches are significantly reduced, resulting in a substantial improvement in the reliability and credibility of the matching results.
Quantitative analysis
Matching precision The model was evaluated on a test set of 1500 images under the same testing conditions as in41, with the experimental setup being consistent, including the same number of test iterations and evaluation metrics. The results, as presented in Table 1, demonstrate that MSpLoFTR achieved a 0.83% improvement in matching precision compared to LoFTR. This validates the effectiveness of multi-scale convolution and feature fusion strategies in reducing mismatches, enhancing accuracy, and improving robustness. Furthermore, MSpGLoFTR exhibited an additional 1.21% increase in precision, attributed to the gated convolutional mechanism, which dynamically adjusts the weights of critical features. This mechanism effectively strengthens the correlation of feature extraction while suppressing redundant noise. The experimental results clearly indicate that MSpGLoFTR significantly outperforms other methods in complex scenarios.
Inference time In model evaluation, inference time, performance evaluation time, and result storage time are critical metrics. Inference time measures the duration from input image processing to outputting matching results, directly impacting the model’s real-time applicability. Performance evaluation time is used to calculate metrics such as pose estimation accuracy, while result storage time refers to the duration required to save matching results for subsequent analysis. Table 2 presents the time consumption of the three models on a test dataset comprising 1,500 samples. The inference time of LoFTR is 79 ms, while MSpLoFTR and MSpGLoFTR record 86 ms and 83 ms, respectively, all meeting real-time performance requirements. Although the latter two models exhibit slightly higher inference times, their improvements in matching precision and robustness outweigh the computational overhead, making them suitable for practical application scenarios.
Relative pose estimation This paper evaluates the relative pose estimation performance of the models under error thresholds of \(5^{\circ }\), \(10^{\circ }\), and \(20^{\circ }\) (see Table 3). The results show that both MSpLoFTR and MSpGLoFTR outperform LoFTR. Specifically, MSpLoFTR achieves improvements of 1.79%, 1.43%, and 0.97%, which can be attributed to its multi-scale feature fusion strategy that enhances adaptability to complex scenes. Based on this, MSpGLoFTR further optimizes performance by incorporating a gated convolution mechanism, achieving additional improvements of 2. 35%, 1. 73%, and 1. 96%, significantly improving key feature extraction capability and robustness. In summary, MSpGLoFTR demonstrates superior accuracy in pose estimation and adaptability to various scenes.
Visual localization
Visual localization is a critical task in image matching, aiming to estimate the 6-DoF pose of query images based on 3D scene models and provide spatial posi-tioning information for various applications. For a fair comparison, we employ the HLoc42 hierarchical localization framework, which has been widely used in visual localization tasks. The Aachen v1.1 dataset43 serves as a challenging large-scale outdoor dataset, featuring significant variations in viewpoints and illumination be-tween day and night, which particularly test the robustness of matching methods. We adopt the global localization trajectory of the Aachen v1.1 dataset for benchmarking. Specifically, we conduct experiments under both daytime and nighttime conditions and report the proportion of results satisfying different thresholds of position and rotation errors, adhering to the following metrics:
Position error thresholds: 0.25 m, 0.5 m, 1.0 m
Rotation error thresholds: \(2^{\circ }\), \(5^{\circ }\), \(10^{\circ }\)
The experimental results, as shown in Table 4, demonstrate that MSpGLoFTR significantly enhances localization accuracy in both daytime and nighttime scenarios through its multi-scale fusion and dynamic gating mechanisms, particularly exhibiting superior performance under the 0.25m, \(2^{\circ }\) and 1.0m, \(10^{\circ }\) thresholds.
Discussions
The MSpGLoFTR model proposed in this paper demonstrates outstanding performance in feature matching tasks within complex scenarios. By incorporating the Multi-Scale Local Attention Module (MSLAM) and Multi-Scale Parallel Attention Module, the model significantly enhances its ability to capture fine-grained local features and global contextual information, while also improving its adaptability to significant scale variations and complex scenes. Furthermore, by integrating the Gated Convolution Mechanism (GCN), the model dynamically adjusts the weight of key features, emphasizing important region features while effectively suppressing background noise, thereby further improving matching accuracy, robustness, and computational efficiency. Experimental results show that MSpLoFTR and MSpGLoFTR significantly outperform traditional methods (e.g., SIFT) and existing models (e.g., LoFTR) in terms of matching accuracy, relative pose estimation, and visual localization. Particularly, MSpGLoFTR performs exceptionally well in scenarios with significant illumination variations, scale differences, and viewpoint changes.
While MSpGLoFTR excels in several aspects, it still has certain limitations. First, due to the introduction of the Multi-Scale Local Attention Module, Parallel Attention Mechanism, and Gated Convolution Mechanism, the model’s computational complexity is relatively high, which could become a bottleneck in real-time applications. Therefore, an important future direction will be to reduce computational overhead while maintaining accuracy and robustness. Second, although MSpGLoFTR improves adaptability to large-scale features, the current multi-scale feature fusion strategy still falls short when handling extreme scale variations or small object matching tasks. Future work could explore finer-scale adjustment mechanisms or adaptive multi-scale modeling methods to further optimize performance in these cases.
Nevertheless, MSpGLoFTR provides an efficient and robust solution for feature matching in complex visual tasks, particularly demonstrating outstanding performance in scenarios with significant illumination changes, scale differences, and viewpoint variations. In the future, we will focus on optimizing computational efficiency, enhancing the model’s adaptability under low-texture and extreme scale changes, and further improving its generalization capability.