A Python implementation of clustering techniques applied to the KDD Cup 1999 Intrusion Detection System (IDS) dataset, demonstrating both 2D and 3D visualizations with and without Principal Component Analysis (PCA).
This repository contains code for analyzing and visualizing network intrusion detection data using K-means clustering. It demonstrates how different clustering approaches can help identify patterns in network traffic that may indicate various types of attacks. The implementation showcases:
- Data preprocessing and cleaning techniques
- Attack categorization and classification
- Dimensionality reduction using PCA
- 2D and 3D visualization of clustering results
The project uses the KDD Cup 1999 Intrusion Detection System dataset, which contains a wide variety of simulated intrusions in a military network environment. The dataset includes:
- Normal connections
- Four main categories of attacks:
- Denial of Service (DoS)
- User to Root (U2R)
- Remote to Local (R2L)
- Probing
Each connection in the dataset is represented by 41 features and labeled as either normal or a specific type of attack.
To run this code, you'll need:
- Python 3.x
- The following Python libraries:
- pandas
- numpy
- scikit-learn
- matplotlib
- Clone this repository:
git clone https://github.com/username/network-intrusion-clustering.git
cd network-intrusion-clustering
- Install the required dependencies:
pip install pandas numpy scikit-learn matplotlib
- Ensure the KDD Cup dataset file (
kddcup.data_10_percent
) is in the root directory of the project.
Run the main script to perform clustering and visualization:
python main.py
This will:
- Load and preprocess the KDD Cup dataset
- Apply K-means clustering with 5 clusters
- Generate four visualization plots:
- 2D clustering without PCA
- 2D clustering with PCA
- 3D clustering without PCA
- 3D clustering with PCA
- Data Preprocessing: Handles duplicates, outliers, and encodes categorical features
- Attack Categorization: Classifies attacks into five categories (normal, DoS, U2R, R2L, probe)
- Flexible Sampling: Supports adjustable dataset sampling for testing and development
- Multiple Visualization Options: Provides both 2D and 3D visualizations with and without dimensionality reduction
This project is licensed under the MIT License - see the LICENSE file for details.