Key insights into recommended SMS spam detection datasets

Daily Zen Mews


This study serves to evaluate ten publicly available SMS spam datasets by using Decision Tree (DT) and Multinomial Naïve Bayes (MNB). This section of the paper serves to further discuss the possible explanation for the obtained result in Section IV. The discussion is structured based on a series of questions that were derived during the analysis of the results.

Comparative result of multinomial naive bayes and decision tree Across different datasets

An overall analysis of the result from Decision Tree and Multinomial Naïve Bayes from both groups can be shown the superior performance of the Multinomial Naïve Bayes (MNB) model over Decision Trees (DT) in detecting SMS spam texts, which can be attributed to the intrinsic characteristics of these algorithms.

MNB is exceptionally well-suited for text data due to its foundational assumptions of word independence and frequency40,41. This characteristic aligns seamlessly with the bag-of-words model commonly employed in text classification. The probabilistic framework of MNB, based on Bayes’ Theorem, enables it to manage word distribution in text data with high efficacy42. MNB calculates the probability of a message being spam based on word frequencies, assuming that the presence or absence of a particular word in a message is independent of any other word. While this assumption simplifies the modeling process and allows for efficient classification, it also means that MNB may struggle with capturing deeper contextual, non-linear relationships or complex dependencies between words, which could limit its effectiveness in datasets where spam messages exhibit more sophisticated linguistic structures.

Conversely, Decision Trees (DT) are more apt for structured data characterized by explicit feature-value relationships. Unlike MNB, DT can model complex decision boundaries and capture nonlinear relationships in data, making them highly interpretable and adaptable to structured classification problems43. However, in high-dimensional textual data, the model’s reliance on discrete feature splits can lead to inefficiencies, particularly when words are sparsely distributed across messages. Effective application of DT to textual data necessitates intricate feature engineering, such as feature scaling, weighting, or dimensionality reduction, to mitigate the challenges posed by the high cardinality of unique words43.

MNB can effectively manage high-dimensional sparse data44, a common characteristic of text classification tasks. This is because MNB can deal with the sparsity of word occurrences across documents. The model’s probabilistic approach is robust in learning from the frequencies of words44. On the other hand, DT can struggle with sparse data because they have to split nodes based on the presence or absence of words. This often leads to overfitting and reduced generalization45,46. The tree structure in DT becomes too complex and specific to the training data when dealing with such high-dimensional sparse datasets47.

Regarding the bias-variance trade-off, MNB has high bias but low variance48, meaning that while it makes strong simplifying assumptions, it tends to generalize well across different datasets without drastic performance fluctuations. This robustness is particularly beneficial when working with datasets that contain noise or inconsistencies, such as Dataset 5, where extensive preprocessing is required. However, MNB’s inability to capture complex interactions between words may lead to performance limitations, particularly in cases where spam messages contain subtle linguistic variations or semantic patterns that require contextual understanding.

In contrast, DT tends to have lower bias but higher variance, making it more flexible in capturing complex decision boundaries but also more prone to overfitting, especially when applied to high-dimensional datasets49. Overfitting occurs when DT learns from noise or idiosyncratic patterns in the training data rather than capturing generalizable trends, leading to weaker performance on unseen data. This issue is particularly evident in datasets with high complexity, such as Dataset 5, where variations in spam message structures may cause DT to create overly specific decision rules that do not generalize well across different spam categories.

The superior performance of Multinomial Naive Bayes (MNB) over Decision Trees (DT) observed in this study aligns with findings from prior research. A comparative analysis of the model’s highest accuracy, as presented in Table 8, further corroborates these results in relation to previous studies.

Table 8 The superior accurcy of MNB over DT, as supported by prior studies.

Factors influencing performance variability in experimental runs

Upon a closer inspection of the model performance of DT and MNB, it can be concluded that both models exhibit different performance across all datasets used in this study. Performance variability refers to the differences in the experimental results obtained from each dataset, which can be influenced by factors such as dataset characteristics, class distribution, and the presence of noisy data.

The inherent complexity and structure of the data can significantly influence model performance. DT may encounter difficulties with datasets that exhibit complex and nuanced patterns, which probabilistic models like MNB are better equipped to handle. This is because DT tend to overfit on the intricate details of the training data, leading to varied performance when tested on different datasets52. In addition, language variations across datasets can influence model effectiveness. Each dataset representing different languages or transliterations can affect MNB’s performance based on linguistic features unique to each language52. Word frequencies, sentence structures, and common spam words vary across languages, contributing to performance variability52.

Class imbalance is another critical factor affecting performance variability. Variations in the distribution of classes within the training and testing sets can cause significant fluctuations53. The size, quality, and spam-to-nonspam ratio of each dataset profoundly impact MNB’s performance. Larger datasets generally provide a more robust learning base, while imbalanced datasets can skew results54. If a particular class is underrepresented in a split, the model may struggle to recognize it, thereby affecting overall performance.

The presence of random noise and outliers in the data can further contribute to performance variability, leading to inconsistent outcomes52,55. When noise affects different parts of the data during each run, the model’s training process can be disrupted in varying ways55. For example, datasets with high noise levels or irrelevant features may cause DT to grow overly complex, leading to poor generalization56.

Dataset characteristics significantly influenced the performance of both models. Dataset 2, which achieved the highest accuracy for MNB (99.03%), exhibited a nearly balanced class distribution and contained minimal noise, allowing MNB’s probabilistic approach to effectively distinguish between spam and non-spam messages. Similarly, Dataset 3, which recorded the highest accuracy for DT (98.35%), contained multilingual text suggesting that DT was better able to leverage structured word patterns across different languages. However, MNB’s word independence assumption likely contributed to its slightly lower performance in this dataset.

In contrast, Dataset 5, which recorded one of the lowest accuracies for both models (MNB: 86.10%, DT: 76.55%), demonstrated high feature diversity and real-world complexity. The dataset’s high qualitative assessment score indicates a rich set of spam characteristics, yet the presence of noise, feature redundancy, and potential label inconsistencies may have contributed to lower classification performance. DT’s tendency to overfit was particularly evident in Dataset 5 and Dataset 9, both of which have high Gini coefficients (0.4999), indicating a more balanced distribution of spam and ham messages. However, the high feature complexity and presence of transliterated text in Dataset 9 negatively impacted both models, particularly DT, which struggled to create meaningful feature splits.

Assessment of datasets

A high-quality dataset can be evaluated quantitatively or qualitatively. Quantitative evaluation of the datasets used in this research has been thoroughly discussed in previous sections. These datasets were then subjected to a qualitative assessment to determine its reliability, effectiveness, and reusability, which encompasses several critical assessments, such as the authenticity of the source, class imbalance, the diversity of features, the availability of metadata, and the data preprocessing requirement. Each of the datasets are evaluated based on the five-level evaluation points for each assessment, which coincide with the five Likert scale points defined in each of the respective table.

A dataset retrieved or downloaded from an authentic source offers various benefits, including high accuracy and reliability, which reduces the likelihood of errors and inconsistencies during data analysis20. Additionally, authentic datasets are often accompanied by clear documentation to facilitate reusability, allowing other researchers to replicate the study and verify the results57. Furthermore, data from reliable sources typically exhibit consistency in format, structure, and quality, simplifying data preprocessing and analysis, which reduces the need for extensive cleaning and transformation20. To assess the authenticity of the data source, this research followed the work of58 and constructed a five-level evaluation system as follows, which directly influence the qualitative assessment of the dataset’s source authenticity.

  1. 1.

    Is the dataset published in a peer-reviewed journal, a conference paper, or been used in a competition?

  2. 2.

    Is there clear documentation about how the data was collected?

  3. 3.

    Is there a clear history of the data, including any transformations or processing steps it has undergone?

  4. 4.

    Is the publication or release date of the dataset clearly stated?

  5. 5.

    Is the dataset internally consistent with no unexplained variations?

Table 9 shows the score assigned to each dataset for their authenticity of source. The scores are assigned based on the given evaluative points above. For instance, Dataset 1 receives the highest Likert value of 5 because the dataset was published in a peer-reviewed journal, has clear documentation about how the data was collected, clear history of the data, has clearly stated the release date of the dataset, and is internally consistent with no unexplained variations.

Table 9 Scores assigned to each dataset for source authenticity.

The class distribution of a dataset is important as it can directly and indirectly influence the obtained results. A balance class distribution improved the model’s performance and help the model to generalize better to new, unseen data, as they are less likely to be biased towards the more frequent classes59. On the other hand, imbalance class distribution will result in poor model performance, especially because it causes low recall for minority classes, and skewed accuracy60. To assess the balance of class distribution within the dataset, this research proposed a five-level evaluation system to quantify the severity of class imbalance, inspired by the imbalance ratio (IR) discussed in prior work61,62. This framework allows for a more nuanced understanding of imbalance severity and informs the selection of appropriate techniques, such as threshold adjustment63 to improve model performance. The proportions of spam and non-spam instances are computed and converted into percentages. These percentages are then evaluated using a rating scale ranging from 1 to 5, as delineated by the subsequent criteria:

  1. 1.

    (Very Poor): The dataset is highly imbalanced, with one class comprising 90% of the data.

  2. 2.

    (Poor): The dataset is significantly imbalanced, with one class comprising 80% of the data.

  3. 3.

    (Fair): The dataset is moderately imbalanced, with one class comprising 70% of the data.

  4. 4.

    (Good): The dataset is slightly imbalanced, with one class comprising 60% of the data.

  5. 5.

    (Excellent): The dataset is perfectly or nearly perfectly balanced, with classes having equal or nearly equal representation.

Table 10 shows the score assigned to each dataset for their class imbalance. The scores are assigned based on the given based on the five-level classification system above. For instance, Dataset 6 receives the highest Likert value of 5 because the dataset has a perfectly balanced distribution of spam and non-spam.

Table 10 Scores assigned to each dataset for class imbalance.

The incorporation of diverse features in SMS spam detection research serves to enhance detection accuracy and the model’s robustness. Varied feature types capture distinct facets of spam messages, enabling the detection algorithm to make more nuanced decisions. This aligns with the findings of64, who noted that diversity in datasets helps models generalize their learnings to new and unseen cases. Similarly, incorporating diverse feature types ensures that the model can handle a wider range of spam characteristics, thus improving its robustness and adaptability. To measure feature diversity within datasets, this research proposes a five-level evaluation system that quantifies diversity purely based on the number of features present rather than the specific type of features. The rationale for this classification is as follows:

  • Datasets with fewer features (Scores 1–2) contain only basic text and labels, limiting their ability to provide meaningful distinctions between spam and ham messages.

  • Datasets with moderate features (Score 3) begin to introduce additional attributes, offering minor improvements to model learning.

  • Datasets with high feature diversity (Scores 4–5) provide richer insights by incorporating multiple attributes, which significantly enhance model performance and generalizability.

The five-level evaluation system is as follows:

  1. 1.

    (Very poor): The dataset only contains two features; the raw text messages and its label (e.g., spam/ham).

  2. 2.

    (Poor): The dataset contains three features; raw text messages, labels, and an additional attribute.

  3. 3.

    (Fair): The dataset contains four features; raw text messages, labels, and two additional attributes.

  4. 4.

    (Good): The dataset contains five features; raw text messages, labels, and three additional attributes.

  5. 5.

    (Excellent): The dataset contains six or more features, providing a diverse range of attributes that significantly improve spam detection.

Table 11 shows the score assigned to each dataset for their diversity of features. The scores are assigned based on the given based on the five-level classification system above. For instance, Dataset 5 receives the highest Likert value of 5 because the dataset has five additional attributes in addition to raw text messages and labels.

Table 11 Scores assigned to each dataset for features diversity.

Metadata encompasses supplementary details associated with a text message beyond its actual content. These additional pieces of information furnish crucial contextual insights about the dataset, encompassing its origin, purpose, and structural characteristics, which plays a pivotal role in accurately interpreting the outcomes derived from the dataset22. Furthermore, robust metadata practices foster data sharing and collaborative endeavours among researchers by simplifying the comprehension and utilization of shared datasets58. The clarity inherent in meticulously documented metadata enhances communication and collaboration across disciplinary and institutional boundaries. To evaluate metadata availability, this research proposes a five-level evaluation system that assesses metadata richness based on the number of metadata fields present and their degree of exposure. The rationale for this classification is as follows:

  • Datasets with minimal metadata (Scores 1–2) provide little to no contextual information, reducing their applicability for advanced analysis.

  • Datasets with moderate metadata (Score 3) include some metadata fields but may have missing values or limited exposure.

  • Datasets with high metadata availability (Scores 4–5) provide structured, comprehensive metadata, improving interpretability and dataset usability.

The five-level evaluation system is as follows:

  1. 1.

    (Very poor): The dataset contains only text messages and labels (spam/ham) with no additional metadata.

  2. 2.

    (Poor): The dataset includes 1–2 metadata fields, but exposure is limited or inconsistent.

  3. 3.

    (Fair): The dataset contains 3–4 metadata fields, offering some context but lacking full exposure.

  4. 4.

    (Good): The dataset contains 5–6 metadata fields, with structured exposure of metadata across most records.

  5. 5.

    (Excellent): The dataset contains 7 or more metadata fields, providing fully detailed and consistently structured metadata across all records.

Table 12 shows the score assigned to each dataset for their metadata availability. The scores are assigned based on the given based on the five-level classification system above. For instance, Dataset 5 receives the highest Likert value of 5 because the dataset has a fully detailed and consistently structured metadata across all records.

Table 12 Scores assigned to each dataset for metadata availability.

The assessment of data preprocessing pertains to the extent of preparatory measures needed to render the data compatible for model ingestion. This preparatory phase encompasses both data cleansing and integration procedures. The degree of preprocessing varies across datasets, with certain datasets necessitating more extensive preprocessing efforts than others. Consequently, datasets requiring extensive preprocessing impose a higher computational burden, impacting the feasibility of research workflows.

To evaluate the effort required for data preprocessing, this research proposes a five-level evaluation system based on the complexity of preprocessing tasks required. The rationale for this classification is as follows:

  • Datasets requiring minimal preprocessing (Scores 1–2) are well-structured and nearly ready for use, with only minor cleaning needed.

  • Datasets requiring moderate preprocessing (Score 3) contain minor inconsistencies that necessitate text normalization and label standardization.

  • Datasets requiring extensive preprocessing (Scores 4–5) are highly unstructured, with significant noise, missing values, and imbalanced data, requiring multiple preprocessing steps.

The five-level evaluation system is as follows:

  1. 1.

    (Minimal): The dataset requires little or no preprocessing.

  2. 2.

    (Low): The dataset requires minor formatting adjustments.

  3. 3.

    (Moderate): The dataset requires the application of standard preprocessing steps.

  4. 4.

    (High): The dataset requires significant preprocessing steps.

  5. 5.

    (Extensive): The dataset requires multiple significant preprocessing steps.

Table 13 shows the score assigned to each dataset for their requirement of data preprocessing. The scores are assigned based on the given based on the five-level classification system above. For instance, Dataset 2 receives the low Likert value of 2 because the dataset requires minor formatting adjustments.

Table 13 Scores assigned to each dataset for data preprocessing requirements.

Factors contributing to accuracy variations in dataset 4 and dataset 7 for both group of experiment

This research employs ten publicly available SMS spam detection dataset. Among the ten datasets used in this study, Dataset 2, Dataset 4, Dataset 7 and Dataset 8 are presented in their natural linguistic form, rather than being transliterated as compared to other non-English language datasets. Yet, an interesting observation arise when we compare the trend of model performance of DT and MNB in the first and second group of experiment for Dataset 7. While the model performance for DT and MNB increase in the second group of experiment as compared to the first group of experiment for Dataset 2, Dataset 4 and Dataset 8, only DT showed an increase model performance, while MNB showed a decline in model performance. To ensure a smooth and logical discussion, Dataset 4 is randomly chosen among the other monolingually non-English language dataset to be compared against Dataset 7.

Table 14 shows accuracy achieved by DT and MNB for Dataset 4 and Dataset 7 in both group of the experiment. From Table 14, it can be seen that while the accuracy for DT and MNB increases for Dataset 4 when comparing their performance of the first group of experiment with the second group of experiment, only the accuracy for DT increases for Dataset 7 whereas the accuracy for MNB decreases when comparing the accuracy of the model of the first group of experiment with the second group of experiment. The differences in performance between Dataset 4 and Dataset 7 in both group of experiment can be attributed to several factors: The relevance of stopwords and sensitivity of the models on both datasets.

Table 14 The accuracy of DT and MNB for dataset 4 and dataset 7 for both group of the experiment.

The improvement observed in Dataset 4 when Bengali stopwords are removed can be attributed to the specific linguistic features of Bengali. The removal of Bengali stopwords likely reduced noise and irrelevant features, improving the quality of the features available for both models65. For other non-English datasets, transliteration issues and language-specific nuances could impede the effectiveness of stopwords removal. Inconsistent transliteration and variations in spelling can leave noise in the data, limiting the improvement in model performance66. Additionally, the quality of the stopword list plays a crucial role in model performance. According to64, an incomplete or inaccurate stopword list can limit the expected improvement in performance. In this study, the experimental results for both MNB and DT showed improvements after removing Bengali stopwords for the Dataset 4, suggesting that the stopword list used was comprehensive. The distinct separation between spam and non-spam messages in Dataset 4, after stopwords removal, highlights the efficacy of this preprocessing step.

Conversely, the original language in Dataset 7 might rely heavily on stopwords to convey essential context. Removing these stopwords can disrupt the contextual integrity required for MNB to perform effectively, as this model relies on the word frequency distribution to make accurate predictions. In this case, stopwords carry significant meaning within the language structure, and their removal can negatively impact MNB’s performance. However, DT benefits from the removal of stopwords in Dataset 7, as this model can better handle a reduced feature set by focusing on the remaining words, suggesting that the stopwords in Dataset 7 were adding unnecessary complexity and noise in DT which hinders its decision-making process.

The sensitivity of MNB and DT to stopwords removal also plays a crucial role in the observed accuracy variations. For MNB, the performance in Dataset 4 improves with the removal of stopwords, as it reduces noise and enhances the signal, allowing the model to focus on more informative words. In contrast, for Dataset 7, the removal of stopwords disrupts the probability calculations that MNB relies on, thereby reducing its accuracy. In Dataset 4, DT similarly benefits from reduced complexity and less noise, leading to clearer decision boundaries and improved accuracy. On the other hand, DT shows an increase in accuracy when stopwords are removed from Dataset 7, indicating that stopwords in this dataset were acting as noise, and their removal helped the DT model create more accurate splits.

It is important to note that while MNB showed an accuracy trend that does not align with most of the other observation with other datasets when comparing the model’s accuracy of the first group of experiment with the second group of experiment in Dataset 7 (refer to Table 14), the class imbalance between Dataset 4 and Dataset 7 could explain the differences of the accuracy observed. As shown in Table 9, Dataset 4, rated 4 on the Likert scale, is more imbalanced than Dataset 7, rated 5, resulting in misleadingly high accuracy for Dataset 4 due to data imbalance. This underscores the issue of skewed accuracy and highlights the critical role of stopwords in addressing such challenges.

Factors Contributing to Enhanced Accuracy in Dataset 3 Relative to Other Datasets

Dataset 3, which consists of SMS messages in English, German, and French, exhibited a unique performance trend in which the Decision Tree (DT) model outperformed Multinomial Naïve Bayes (MNB). This performance difference can be attributed to several factors, including the impact of the language feature independence assumption, handling of class imbalance, the incomplete stopword removal process in the first group of experiments, and overfitting and variance.

One key factor influencing the model’s performance is the language feature independence assumption inherent to MNB. This model assumes that word occurrences are independent, meaning that each word’s probability is calculated separately from others. While this assumption often works well for monolingual datasets, it becomes problematic in multilingual datasets like Dataset 3, where word meanings and distributions vary across different languages. For example, common spam-related words in English are expressed differently in French and German. Since MNB aggregates word frequencies across all three languages without distinguishing between them, it fails to recognize spam indicators effectively across multiple linguistic structures. Conversely, DT does not rely on the independence assumption and instead recursively splits the dataset based on the most informative features64,65. This flexibility allows DT to adapt to language-specific spam indicators, making it more effective in handling multilingual datasets like Dataset 3.

Another crucial factor affecting the models’ performance is class imbalance. Dataset 3 contains 2,241 spam messages compared to 14,460 non-spam messages, creating a significant imbalance that influences model learning, as indicated by its rating on the Likert scale in Table 9. As a probability-based classifier, MNB struggles with imbalanced data because its probability estimates are naturally skewed in favor of the majority class (non-spam messages). In other words, MNB may not handle class imbalance as effectively unless specific techniques like class weighting or resampling are employed44. As a result, spam messages are often misclassified due to lower word frequencies. In contrast, DT is more resilient to class imbalance as it learns decision rules based on how well each feature (word) separates spam from non-spam. Instead of relying solely on word occurrence probabilities, DT dynamically adjusts its decision boundaries, allowing it to classify spam messages more effectively even when they are the minority64,65.

The stopword removal strategy in the first group of the experiment also played a significant role in model performance differences. In the first group of the experiment, only English stopwords were removed while German and French stopwords remained in the dataset. This had a disproportionate effect on MNB, as it relies heavily on word frequency distributions. The presence of frequent yet uninformative German and French stopwords had introduced noise into MNB’s probability calculations, reducing its ability to differentiate between spam and non-spam. Since MNB assigns equal importance to all words, these stopwords diluted the significance of actual spam-related terms, leading to lower classification accuracy. In contrast, DT naturally selects the most important words for classification through its recursive splitting process, meaning it was less affected by the presence of unremoved stopwords. This allowed DT to remain more robust despite the incomplete stopword removal process.

Additionally, the consideration of overfitting and variance may provide further explanation. DT, while prone to overfitting, can perform exceptionally well, capturing patterns effectively without overfitting if the dataset is not too noisy67. In contrast, MNB, generally less prone to overfitting due to its simplicity, might ignore some intricate patterns that a DT could capture44.

Dataset recommendation

In the present research, ten SMS spam detection datasets were analyzed. Each dataset is characterized by its distinct attributes, which exert influence on the performance of the employed models: Decision Tree and Multinomial Naïve Bayes. The primary objective of this investigation is formulating dataset recommendations predicated upon model performance. Specifically, these recommendations are anchored in the accuracy metrics generated by the models. Since MNB consistently outperformed DT in the experiments, dataset evaluations are based on MNB’s results to provide more reliable and accurate insights for future research.

The dataset recommendation will be made based on the quantitative result (model accuracy) and qualitative assessment. To facilitate the quantitative recommendation process, a set of grading criteria is introduced, contingent upon the accuracy levels attained by MNB. These criteria are stratified into three distinct categories: high accuracy (≥ 95%), moderate accuracy (90 − 94.99%), and low accuracy (< 90%). Given the absence of established industry benchmarks and previous studies providing thresholds for SMS spam detection, the categorization in this study serves as a means to interpret model performance across datasets. This exploratory approach may offer guidance for future research in developing more definitive benchmarks for SMS spam detection. Furthermore, given the delineation of the research into two experimental groups, the recommendations will encompass MNB performance from both groups. The guideline of categorization criteria for each dataset, classified as most challenging, moderately challenging, or least challenging, are presented in Table 15. It is important to note that these criteria do not apply to Dataset 1, Dataset 3, and Dataset 5, as they were not included in the second experiment group. Their categorization is therefore determined based solely on MNB’s accuracy in the first group.

In this research, a challenging dataset is one in which the models exhibit lower accuracy, not because of flaws, but due to its diverse spam patterns, real-world complexity, and feature richness. While noise and ambiguity may contribute to difficulty, such datasets encourage the development of more adaptable and generalizable models. Additionally, the recommended dataset is the one with the highest overall qualitative assessment score, ensuring it is well-documented, diverse, and beneficial for advancing spam detection research.

Recommending the most challenging dataset is beneficial as it highlights dataset complexity, thereby driving the development and refinement of more robust and sophisticated models. Furthermore, challenging datasets promote the advancement of algorithms capable of greater adaptability and resilience to diverse forms of noise and ambiguity. Additionally, recommending the dataset with the highest average score across qualitative factors improves research quality, model performance, and usability, while mitigating risks related to bias, data inconsistencies, and unnecessary complexity.

Table 15 The guideline of categorization criteria for each dataset.

Based on the aforementioned distinct categories, the category of accuracy for each dataset with the removal of English language stopwords and with the removal of respective non-English language stopwords is summarized in Tables 16 and 17, respectively. Table 18 shows the overall category of challenges for each dataset based on the delineated dataset criteria.

Table 16 The category of challenges of each dataset with the removal of english Language stopwords.
Table 17 The category of challenges of each dataset with the removal of the respective non-English Language stopwords.
Table 18 Overall category of challenges for each dataset.

Based on Table 18, Datasets 1, 2, 3, 6, and 8 are identified as the least challenging for MNB, making them high-quality datasets suitable for baseline comparison studies or testing new models due to their consistent performance. In contrast, Datasets 7 and 9 present moderate challenges, making them useful for assessing model robustness and refining algorithms or feature engineering techniques. Datasets 4, 5, and 10 are the most challenging, offering valuable testbeds for developing and evaluating novel methodologies to enhance model performance under more complex conditions. These recommendations are grounded in the observed model performance.

By averaging the Likert score assigned to each dataset from Tables 9, 10, 11, 12 and 13, an additional recommendation emerges based on qualitative assessments; dataset authenticity, class imbalance, feature diversity, metadata availability, and preprocessing quality. As indicated in Table 19, Dataset 5 has the highest average score, making it the most recommended dataset. Given that Dataset 5 is also one of the most challenging for MNB, Dataset 5 is strongly recommended for future SMS spam detection research. This is followed by Dataset 6 and Dataset 8, particularly for evaluating model performance in complex scenarios.

Table 19 Average value of each dataset based on the likert values from Tables 9 to Table 13.

Since the recommended dataset, Dataset 5, has one of the highest average Likert values from Table 18, it can be an ideal testbed for driving algorithm development and enhancing the adaptability and robustness of SMS spam detection models. For example, since Dataset 5 did not score the highest value for the qualitative assessment of class imbalanced, it necessitates the integration of advanced resampling techniques, such as SMOTE or undersampling during the development of algorithm to ensure fair model training and evaluation. Future development of algorithm tuned to Dataset 5 will have account for the introduction of inconsistencies due to data integration from multiple sources which require models to be resilient to noisy, incomplete, and heterogeneous data.

Leveraging Dataset 5 for model training encourages the development of more adaptable algorithms capable of handling diverse spam message structures. Feature engineering techniques, such as extracting semantic patterns, contextual embeddings, and n-grams, can be integrated during algorithm development to further enhance model effectiveness in identifying spam characteristics that may not be explicitly labeled. Additionally, the dataset’s challenging characteristics encourages the experimentation with hybrid and ensemble learning approaches that improve model generalization in order to ensure higher performance across different SMS datasets. Moreover, transfer learning can be explored by fine-tuning models trained on Dataset 5 and applying them to different datasets, which reinforces the model’s ability to generalize across various SMS spam detection tasks.




Source link

Leave a Comment