Explainable AI analysis for smog rating prediction

Daily Zen Mews


The processing schema presented in Fig. 1 starts with data preparation, for which a dataset containing a features vector (X-vector) and a target vector (Y-vector)—in the case of smog rating will be used. Another data preprocessing step is related to the presence of some NA values in the dataset, or null values, and, therefore, they are filled by their mean value of the feature to cover the data set. The dataset is then divided into two subsets: There are two sets of data: training data; which are train X and train Y which are used to build the model and testing data: test X and test Y which are used to assess the model. This division is crucial to get the correct training and testing of the model as well. Next comes model training, where the training data is fed into two models: Random forest and Explanatory boosting classifier. The Random Forest is the type of ensemble learning method based on multiple functional decision trees to improve the predictive capacities. Likewise, EBC model, which is also an ensemble technique, adds decision trees one at a time and enhances the performance hence more interpretability. On the other hand, after training the models are employed in the prediction and explanation processes. With the Test-X data, the Random Forest and EBC models provide smog ratings of Test-X on a scale of 1 to 8, the worse smog being 1 and the better smog being 8.

Fig. 1
figure 1

Proposed work for SmogRating prediction.

To see if there is accuracy in our results, Explainable AI (XAI) methods are used to analyze and compare the algorithm’s predicted values (Test-Y). This step assists in understanding the rationale of the models and their output in order to spot ingrown denial and unmask bias or mistakes. Last, model evaluation is performed employing confusion matrix which gives further details in regards to the accuracy of given models calculating the number of correct and misclassified values in each smog rating class. As well, the evaluation identifies the agnostic and special models for interpretation. Agnostic models provide the predictions without consideration of what goes on in the model to make these predictions while specific models involves optimization of the particular structure and parameter of the model, to provide a detailed answer as to why the particular decision was made.

Dataset preparation

The dataset4 consists of 27,000 rows and 15 columns, with a CSV file size of 2673KB. It measures the environmentally friendly performance of a vehicle with a keen emphasis on the Smog Rating that reflects the respiratory harming potential of a vehicle. The Smog Rating calculates the relative smog impacts of vehicles concerning nitrogen oxides and non-methane organic gases on a Smog Index ranging from 1 to 8. A rating of 8 shows the cleanliness of the automobile with less emission at all while a 1 shows high levels of emission of pollutants. Smog Rating is the opposite of scale, and a higher score is preferred, particularly where there is degradation of the environment within cities.

From the dataset, some observations emerge. For instance, all 2017 Acura models have a consistent Smog Rating of 6, reflecting moderate air quality impact. Despite variations in engine size, fuel consumption, and CO2 emissions, their smog-forming emissions remain similar. In contrast, most 2023 Volvo models have a Smog Rating of 5, slightly worse than the Acura models. However, two exceptions, the XC60 B6 AWD and XC90 B6 AWD, achieve a Smog Rating of 7 due to advanced emissions-reduction technologies, despite their higher fuel consumption.

Smog Rating is affected by several factors. There is still much reliance on engines, which have elaborate systems such as the catalytic converter that greatly reduce emissions. Another factor is fuel type as cars that use hybrid or electricity have much lower emissions to those that use fuel. Engine displacement and power are other important parameters; the greater the engine displacement; the higher emissions of smog forming substances, despite advances in technology. Consumer ratings depend on vehicle class as well, since compact vehicles and hybrids are cleaner than, for instance, large SUVs and high-performance automobiles.

The Smog Rating extends the CO2 Rating, which quantifies a vehicle’s climate footprint over its lifetime by factoring in greenhouse emissions. While the CO2 Rating aims at global warming, the Smog Rating is focused on short-term issues connected with air pollution and is thus important for large cities.

The major challenges for the dataset involve transforming it into a machine-readable format by addressing missing values, normalizing, and scaling. The following techniques were applied to continuous, categorical, and discrete columns to prepare the dataset for machine learning: handling missing data through imputation, applying normalization to continuous features to ensure they are within a consistent scale, and encoding categorical variables to make them interpretable by machine learning algorithms. These steps were essential to ensure the dataset was clean, consistent, and ready for model training.

There are fifteen features in the dataset, however, they are classified into discrete, continuous and categorical data. These are the discrete variables; Model Year; Cylinders; CO2 Rating; and Smog Rating, some of which is captured in Table 1 below.

Table 1 Discrete features in dataset.

The continuous variables including EngineSize_L, FuelConsCity_L100km, FuelConsHwy_L100km, Comb_L100km, Comb_mpg and CO2Emission_g_km are described in Table 2 with example data.

Table 2 Continuous features in dataset.

All continuous features from Table 2 shown to be normally distributed using appropriate statistical methods and visualizations in Fig. 2.

Fig. 2
figure 2

Normal distribution of continuous features.

There are several categorical variables in the dataset, such, as models, vehicle classes, transmits, and fuel types, which are crucial when studying vehicles. These features are described in Table 3, while the encoded value is included in Table 4. Of them, the ‘Make’ feature contains 74 subcategories, while an option named Model has been found to contain as many as 2078 subcategories. The VehicleClass feature includes 30 categories, reflecting various types of vehicles, and Transmission accounts for 26 different categories. Lastly, FuelType has 4 categories, representing the range of fuel options used by the vehicles. This detailed categorization underscores the dataset’s complexity and richness in capturing diverse vehicle attributes.

Table 3 Categorical features in dataset.
Table 4 Encoding of Categorical Features.

The target variable, SmogRating, originally consists of 8 classes, though 4 are missing in the dataset. This challenge was addressed by handling 7 classes within the 1–8 rating range as provided as shown in Fig. 3.

Fig. 3
figure 3

Target variable distribution.

This dataset offers a rich resource for analyzing vehicle emissions, highlighting the relationships between vehicle features, environmental impact, and performance over nearly three decades.

Embedding techniques

Embedding techniques in machine learning refer to methods used to represent categorical variables or textual data as numerical vectors, which are suitable for training models. In this case, categorical data was transformed using Label Encoding, where each category is assigned a unique integer. This method allows models to interpret categorical features effectively. For continuous numerical data, Standard Scaling is often employed to normalize values, ensuring that features are on a comparable scale, improving model performance. SMOTE (Synthetic Minority Over-sampling Technique) was applied to balance class distributions, generating synthetic examples to prevent model bias towards the majority class.

Parameters for fine tuning of model

In fine-tuning the model, various steps and parameters were leveraged to enhance the model’s accuracy and generalization. Random Forest Regressor was employed as the base model, with the data being split into training and test sets using the train_test_split function (test_size = 0.3, random_state = 42) to ensure that the model could be evaluated on unseen data. The model was initialized with RandomForestRegressor(random_state = 42) to ensure reproducibility. Hyperparameters such as the number of trees (n_estimators), maximum depth (max_depth), and minimum samples required to split a node (min_samples_split) were tuned to prevent overfitting and improve model performance. The model was trained on the balanced dataset, which was achieved using SMOTE (Synthetic Minority Over-sampling Technique) to address any class imbalance. After training, the model’s performance was evaluated using metrics like Mean Squared Error (MSE), R-Squared (R2 Score), Mean Absolute Error (MAE), Explained Variance Score (EVS), and Max Error, which provide insights into the model’s accuracy and prediction capabilities. Finally, Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) were employed to explain the model’s predictions, helping interpret which features contributed the most to the model’s output, thereby enhancing model transparency and trust.

Prediction using machine learning model on unbalanced data

A complete flow chart for the Machine Learning Model on Unbalanced Data is provided in Table 5.

Table 5 Flow chart for machine learning model on unbalanced data.

When the preprocessed dataset was applied to the Random Forest model, the results were not satisfactory, as highlighted in the evaluation metrics. The overall accuracy achieved was 75%, as shown in Table 6, which falls short of expectations for reliable model performance. Furthermore, the confusion matrix in Fig. 4 illustrates that the diagonal, representing correctly classified instances, is not strong, indicating significant misclassifications across several classes.

Table 6 Random-forest results based on unbalanced dataset.
Fig. 4
figure 4

Random-forest confusion matrix based on unbalanced dataset.

Additionally, the comparison of actual and predicted values in Fig. 5 reveals poor overlap, further emphasizing the models’ inability to make accurate predictions consistently. These observations suggest that the models struggled to capture the underlying patterns in the dataset, and further refinement in data preprocessing, feature engineering, or model selection might be necessary to improve performance.

Fig. 5
figure 5

First 40 actual and predicted values from RF on unbalanced dataset.

When the preprocessed dataset was applied to the Explainable Boosting Classifier, the results were not satisfactory, as depicted in Table 7, where the accuracy achieved was 71%. Also, the confusion matrix in Fig. 6 reveals a weak diagonal showing that the classifier had difficulty in identifying the exact class of several instances. Moreover, Fig. 7 also reveals low values of the coefficient of determination which represents the extent of accuracy of the total prediction of the model; besides this, there are low values of correlation coefficients from the regression line showing that the model cannot predict the actual values as shown below. Such results imply that the used model could provide suboptimal performance, and further enhancement may be necessary for the preprocessing stage, feature extraction or selection, or model identification.

Table 7 Explainable-boosting-classifier results based on unbalanced dataset.
Fig. 6
figure 6

Explainable-boosting-classifier confusion matrix based on unbalanced dataset.

Fig. 7
figure 7

First 40 Actual and predicted values from EBC on unbalanced dataset.

Prediction using machine learning model on balanced dataset

Complete flow chart for Machine Learning Model on Unbalanced Data is provide in Table 8.

Table 8 Flow chart for machine learning model on balanced data.

The use of SMOTE (Synthetic Minority Over-sampling Technique) was applied to balance the class distribution in the dataset. Before balancing, classes like SmogRating 5 had 2106 instances, while others such as SmogRating 2 and SmogRating 8 were severely underrepresented with only 7 and 117 instances, respectively. After SMOTE was applied, synthetic samples were generated for the underrepresented classes, resulting in a balanced distribution where each class had 2,106 instances. The class distribution before and after balancing can be visualised in Table 9, which highlights the shift from imbalance to equal representation across all classes. Figure 8 shows the unbalanced and balanced class.

Table 9 Random-forest results based on balanced dataset.
Fig. 8
figure 8

Unbalanced and balanced classes.

After applying the balanced dataset to the Random Forest model, the results were the best, with an accuracy of 86%, as shown in Table 9. The confusion matrix in Fig. 9 exhibits a strong diagonal, indicating a high level of correct classifications across all classes. Additionally, Fig. 10 displays the first 40 actual and predicted values, with a near-perfect overlap, highlighting the model’s accurate predictions for the sample dataset. These results demonstrate the effectiveness of using a balanced dataset to improve model performance. Table 9 shows the EBC Results based on Balanced Dataset. Figure 11 illustrates the EBC Confusion Matrix based on Balanced Dataset and Fig. 12 shows the First 40 Actual and Predicted Values from EBC on Balanced Dataset.

Fig. 9
figure 9

Random-forest confusion matrix based on balanced dataset.

Fig. 10
figure 10

First 40 actual and predicted values from random-forest on balanced dataset.

Figure11
figure 11

EBC confusion matrix based on balanced dataset.

Fig. 12
figure 12

First 40 actual and predicted values from EBC on balanced dataset.

After applying a balanced dataset on Explainable-Boosting-Classifier (EBC), the achieved results are best with an accuracy of 86% as shown in Table 7 and the diagonal of Fig. 10 is showing best. Figure 11 shows the first 40 actual and predicted values best overlapping representing samples for all datasets. Table 10 shows the results of the EBC balanced dataset.

Table 10 EBC results based on balanced dataset.




Source link

Leave a Comment