A robust machine learning system for identifying AI-generated (deepfake) voices using MFCC features, spectral audio statistics, and a combination of Deep Learning (CNN) and Classical ML (Random Forest, KNN) models.
- Detects real vs fake speech with 98%+ accuracy
- Uses MFCCs, Mel-spectrogram statistics, and spectral features
- Implements 1D CNN, Random Forest, and KNN
- Includes full evaluation metrics with placeholders for all plots
- Built to support future deployment for fraud and security applications
- Audio format: 16 kHz, mono WAV
- Classes: Real, Fake
- Includes various SNR levels and noise-reduction methods
- Dataset contains class imbalance, handled using SMOTE
- Extracted using Librosa
- Mean-pooled across time
- Used as input to the 1D CNN
- MFCC means
- Mel-Spectrogram means
- Log-Spectrogram (STFT) means
- Used for Random Forest and KNN models
ββββββββββββββββ Preprocessing ββββββββββββββββ
β β
Audio β Load β Normalize β MFCC / Spectrogram Extraction β Features
β β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
β β
318-D Engineered Features MFCC Map (40Γ1)
β β
Random Forest / KNN 1D CNN Model
β β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
**Real / Fake**
- Conv1D β Dropout
- MaxPooling
- Conv1D β Dropout
- Dense + Softmax
- Trained for 40 epochs
- Dev Accuracy: ~86%
- Eval Accuracy: ~88%
- Strong on detecting fake audio
- Mild overfitting
- Accuracy: 98.82%
- High precision & recall on both classes
- Extremely robust to noise & dataset variance
- Accuracy: 98.29%
- Very stable across different samples
- k = 7 chosen for optimal performance
| Model | Accuracy | Strengths | Weaknesses |
|---|---|---|---|
| Random Forest | β 98.82% | Best overall, robust to noise | Slow to train on huge datasets |
| KNN (k=7) | 98.29% | Simple & competitive | Slow inference on large data |
| CNN (MFCCs) | ~88% | Learns temporal patterns | Overfitting risk |
- Dataset imbalance required oversampling
- CNN performance limited by MFCC-only representation
- Needs evaluation on unseen deepfake generators
- Real-world recordings with background noise not fully tested
- Use 2D CNNs on spectrogram images
- Add transformer-based encoders (wav2vec 2.0, HuBERT, Whisper)
- Deploy as a web or mobile app for live detection
- Add adversarial robustness
- Add explainable AI for forensic usage
- Python
- Librosa β audio processing
- TensorFlow / Keras β CNN model
- scikit-learn β RF, KNN, SMOTE
- NumPy / pandas β preprocessing
Srujan Rana







