A CE-GAN based approach to address data imbalance in network intrusion detection systems

Daily Zen Mews


Based on the distribution analysis of the NSL-KDD and UNSW-NB15 datasets and the limitations of classification algorithms, this study utilizes a Generative Adversarial Network (GAN) as the primary method to augment the minority class samples in both datasets. The process flow is illustrated in Fig. 4. After normalizing the dataset, it is divided into majority and minority classes. The minority class samples, along with normal samples, are input into the GAN to generate new samples. Once validated, these new samples are merged with the majority class, completing the dataset augmentation. The augmented dataset is then fed into the classification algorithm, and if the classification performance improves, it demonstrates the effectiveness of the GAN algorithm proposed in this study.

Fig. 4
figure 4

Data augmentation flowchart of generative adversarial network.

Traditional machine learning and deep learning models rely on large amounts of high-quality, balanced data for training. Therefore, while performing data augmentation, it is crucial to ensure that the augmented data does not reintroduce imbalance. Additionally, network traffic data is characterized by high complexity and diversity, which makes it challenging for Generative Adversarial Networks (GANs) to capture surface-level features. To address this, a dimensionality reduction encoding method is required, combined with conditional constraints to distinguish between normal and anomalous data. As illustrated in Fig. 5, we propose a CE-GAN, which integrates conditional encoding. This model comprises four networks: Encoder, Decoder, Generator, and Discriminator. The Encoder and Decoder are responsible for converting between high- and low-dimensional feature data, while leveraging data condition types to further distinguish between normal and anomalous samples. This helps to mitigate the potential impact of data imbalance from the input stage. The Encoder reduces the data’s dimensionality, while the Decoder remaps the data back into high-dimensional space. The Generator and Discriminator are mainly responsible for generating and evaluating data: the Generator produces data samples that meet specific conditions, and the Discriminator evaluates the authenticity of the generated data and whether it meets the conditional constraints. By incorporating this process of conditional dimensionality reduction and restoration, the model effectively preserves key data features and reduces the impact of data imbalance on model training. After receiving the low-dimensional features processed by the Encoder along with the conditional information, the Generator produces new data samples that meet the given conditions. To ensure the diversity and authenticity of the generated samples, the Generator employs adversarial training to continuously optimize its generation capability. The Discriminator is used to distinguish between real and fake input samples while also determining whether the samples meet the given conditions. In this way, the Discriminator not only improves its ability to identify generated samples but also ensures that the generated samples maintain consistency with the feature distribution of the real data.

Fig. 5
figure 5

CE-GAN consists of four key components: an encoder, decoder, generator, and discriminator. While the encoder-decoder pair handles dimensional transformation, the generator creates samples based on low-dimensional features, and the discriminator evaluates both sample authenticity and conditional constraints. The system is trained through multiple loss functions and incorporates modules such as normalization and classifier evaluation to achieve conditional generation.

In data augmentation tasks, enhancing the diversity of the augmented data should be the primary goal. Therefore, we conducted a comprehensive comparative analysis between the improvement in classification performance and the similarity between the generated and original samples. The pseudocode of the CE-GAN model is shown in Table 5, and the improvement in classification model metrics serves as evidence of the effectiveness of the data augmentation algorithm. Additionally, we employed three evaluation metrics for data augmentation: Precision-Recall Distance (PRD), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) to validate and quantify the data augmentation effect of CE-GAN. These generation metrics must remain within a reasonable range-neither too high, which would compromise authenticity, nor too low, which would sacrifice diversity. This ensures that the generated samples strike the optimal balance between diversity and authenticity.

Table 5 Pseudocode for CE-GAN implementation.

During model training, we introduced multiple loss functions to constrain the training outcomes, including adversarial loss, reconstruction loss, diversity loss, and temporal loss. Each loss function plays a crucial role in different parts of the model, contributing to the overall effectiveness of the training process.

Adversarial loss is the core of the Generative Adversarial Network (GAN). Through the game between the Generator and the Discriminator, the Generator continuously improves the authenticity of the generated samples, while the Discriminator enhances its ability to distinguish between real and fake samples and ensure condition matching. Specifically, the conditional adversarial loss can be described as shown in Eq. (6):

$$\begin{aligned} & \mathcal {L}_{\text {adv}} = \mathbb {E}_{x, c \sim p_{\text {data}}(x, c)} \left[ \log D(x, c) \right] + \mathbb {E}_{z \sim p_{z}(z), c \sim p_{\text {data}}(c)} \left[ \log \left( 1 – D(G(z, c), c) \right) \right] & \end{aligned}$$

(6)

Here, x represents the real data, z represents the random noise vector, c represents the conditional information,G represents the Generator, and D represents the Discriminator.

Reconstruction loss ensures effective data dimensionality reduction and restoration between the Encoder and Decoder. By combining data with conditional information, it enlarges the distribution gap between normal and anomalous samples during the dimensionality reduction process. The reconstruction loss is described in Eq. (7):

$$\begin{aligned} & \mathcal {L}_{\text {rec}} = \mathbb {E}_{x, c \sim p_{\text {data}}(x, c)} \left[ \Vert x – D(E(x, c), c) \Vert _{1} \right] & \end{aligned}$$

(7)

Diversity loss is used to ensure the diversity of the generated samples, preventing overfitting and thus maintaining the significance of the Generative Adversarial Network (GAN). It also helps to make the generated samples as realistic as possible. The form of the diversity loss is described in Eq. (8):

$$\begin{aligned} & \mathcal {L}_{\text {cst}} = -\frac{1}{n} \sum _{i=1}^{n} \min _{j} \left( \sqrt{\sum _{k=1}^{d} \left( X_{i,k} – X_{j,k} \right) ^2} \right) & \end{aligned}$$

(8)

Here, \(\sqrt{\sum _{k=1}^{d} \left( X_{i,k} – X_{j,k} \right) ^2}\) represents the Euclidean distance between samples \(X_i\)and \(X_j\) , represents the minimum distance between each sample \(X_i\) and other samples \(X_j\) , \(\mathcal {L}_{\text {cst}}\) represented in the form of a small negative mean in the overall training process, Thus, it prevents the generated samples from losing authenticity in the pursuit of model diversity.

Temporal loss is used to ensure that the generated samples in the low-dimensional space are overall similar to the original data, enhancing the reliability of the generated data. The form of this loss is described in Eq. (9):

$$\begin{aligned} & \mathcal {L}_{\text {mmt}} = \mathbb {E}_{x \sim p_{\text {data}}(x), z \sim p_{z}(z), c \sim p_{\text {data}}(c)} \left[ \Vert E(x, c) – E(G(z, c), c) \Vert _{1} \right] & \end{aligned}$$

(9)

By combining the four loss functions, the model’s training process is constrained from multiple aspects, accelerating the convergence speed and ensuring both the authenticity and diversity of the generated samples. This balance between authenticity and diversity enhances the quality of the samples generated by the CE-GAN.

Fig. 6
figure 6

Detailed structure of the CE-GAN Model.

As shown in Fig. 6, we have outlined the model architectures used by each network in CE-GAN. Given the high complexity and diversity inherent in network attacks, the multi-head attention mechanism of the Transformer allows for flexible extraction of rich feature representations from the input data. To ensure that both the Generator and Discriminator in the CE-GAN model have equal capabilities, we utilized the Transformer architecture for both, enabling them to handle complex data with consistent feature extraction and representation abilities. Through the multi-head attention mechanism, the model can focus on different parts of the data simultaneously, identifying hidden attack features within a broader feature space. This capability is crucial for recognizing complex network attack behaviors.




Source link

Leave a Comment