State Space Models (Griffin/Mamba) vs. Transformers for Sequential Anomaly Detection
This repository contains a comprehensive Proof-of-Concept demonstrating the advantages of State Space Models (SSMs) over standard Transformer architectures for sequential anomaly detection. By leveraging the Griffin architecture (the foundation of Google's RecurrentGemma) and Mamba, we achieve comparable accuracy to Transformers while maintaining O(1) memory complexity and handling significantly longer sequences.
The repository is organized to facilitate academic review and further research into the application of SSMs in financial fraud, ECG anomaly detection, and market surveillance.
The core hypothesis of this research is that State Space Models provide a decisive advantage in domains where long-range temporal dependencies are critical. While Transformers suffer from quadratic O(n²) memory scaling with sequence length, SSMs process sequences with O(1) per-event complexity, enabling theoretically infinite context windows.
- Memory Scaling: Griffin handles sequences up to 4,096 tokens on a standard Tesla T4 GPU, whereas a comparable Transformer runs out of memory (OOM) at just 1,024 tokens. This represents an 8x improvement in context window size on identical hardware.
- Accuracy Parity: On real-world datasets, SSMs achieve parity with Transformers. In the ECG anomaly detection task, Griffin achieved 95.7% accuracy, compared to the Transformer's 95.8%.
- Latency and Throughput: SSMs offer superior streaming capabilities. In the Bitcoin anomaly detection benchmark, Griffin demonstrated a processing speed of 534 ticks/second with O(1) memory, allowing for continuous 24/7 streaming without windowing constraints.
- Early Warning Capabilities: The Griffin model successfully predicted 96% of PVC clusters in ECG data before the event occurred, utilizing only the current recurrent hidden state.
ssm-sequential-anomaly-detection/
├── README.md # This document
├── requirements.txt # Python dependencies
├── notebooks/
│ └── SSM_Fraud_Detection_POC.ipynb # Main Colab notebook
├── src/ # Core model implementations and utilities
│ ├── griffin_model.py # RG-LRU Griffin implementation (PyTorch)
│ ├── train.py # Training loops and utilities
│ ├── future_predictor.py # Next-step prediction architecture
│ ├── data.py # General data loading utilities
│ ├── btc_*.py # Bitcoin anomaly detection code
│ └── ecg_*.py # ECG anomaly detection code
├── results/ # Raw JSON output from benchmark runs
│ ├── ecg_results.json # Detailed ECG metrics and memory scaling
│ └── btc_results.json # Bitcoin anomaly detection metrics
└── figures/ # Generated visualizations and dashboards
├── summary_dashboard.png
├── btc_anomaly_dashboard.png
├── memory_scaling.png
├── memory_complexity.png
└── latency_comparison.png
The main notebook focuses on the IBM Credit Card Transactions dataset, which provides a highly realistic environment for sequential modeling.
| Property | Value |
|---|---|
| Transactions | 24.4 million |
| Consumers | 2,000 synthetic users |
| Avg. transactions per user | ~12,000 |
| Fraud rate | 0.122% |
| Objective | SSMs replace complex feature engineering (rolling velocity) by learning long-term patterns from raw sequential data |
Using the MIT-BIH Arrhythmia Database, we compared Griffin and Transformer architectures on continuous time-series data.
| Property | Value |
|---|---|
| Total beats | 109,451 across 48 recordings |
| Griffin F1 | 0.89 |
| Transformer F1 | 0.89 |
| Griffin max sequence | 4,096 tokens |
| Transformer max sequence | 512 tokens (OOM at 1,024) |
A demonstration of the streaming capabilities of SSMs on real-time financial data.
| Metric | Griffin (RG-LRU) | Transformer |
|---|---|---|
| Events detected | 67/67 (100%) | 67/67 (100%) |
| Avg. early warning | 114s | 111s |
| Memory per tick | O(1) | O(n²) |
| Inference speed | 534 ticks/s | 459 ticks/s |
| Can stream 24/7 | Yes | Window only |
The Griffin architecture combines two key components to achieve efficient long-context modeling:
- RG-LRU (Real-Gated Linear Recurrent Unit): A gated linear recurrence with a diagonal state matrix, providing O(n) memory and compute complexity.
- Local Sliding Window Attention: Attention limited to a fixed window size, capturing local patterns without quadratic scaling.
The notebook also evaluates the Mamba architecture, utilizing a Selective State Space model combined with 1D Convolutions to process sequences efficiently while maintaining dynamic state selection based on the input sequence.
| Sequence Length | Griffin (MB) | Transformer (MB) |
|---|---|---|
| 256 | 1.5 | 2.9 |
| 512 | 1.7 | 6.3 |
| 1,024 | 2.1 | 19.3 |
| 2,048 | 2.9 | 70.4 |
| 4,096 | 4.4 | 273.3 |
| 8,192 | 7.6 | 1,081.7 |
| 16,384 | 13.9 | 4,309.3 |
| 32,768 | 26.5 | 17,206.7 |
Griffin scales linearly O(n). Transformer scales quadratically O(n²) and runs out of memory at 8,192+ tokens on a T4 GPU.
To reproduce the results, we recommend using Google Colab with a T4 GPU.
- Open
notebooks/SSM_Fraud_Detection_POC.ipynbin Google Colab. - Ensure a GPU runtime is selected (Runtime -> Change runtime type -> T4 GPU).
- Follow the instructions in the notebook to input Kaggle credentials for dataset downloading.
- Execute the cells sequentially to train the models and generate the evaluation metrics.
For local execution of the Python scripts, install the required dependencies:
pip install -r requirements.txt- Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
- De, S., et al. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv:2402.19427
- Moody, G.B., & Mark, R.G. (2001). The impact of the MIT-BIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine, 20(3), 45-50.