A comprehensive pipeline for clustering videos from the Something-Something dataset based on visual content using deep learning embeddings.
- Frame-level Embedding Extraction: Uses pre-trained ResNet-50 to extract visual features
- Dimensionality Reduction: Optional PCA for noise reduction and faster processing
- Clustering: K-Means clustering for grouping similar videos
- Cluster Quality Evaluation: Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index
- Cluster Representatives: Automatically finds representative videos for each cluster
- Similarity Search: FAISS-based fast similarity search with query interface
- Visualization: UMAP-based 2D visualization of clusters
- Install required packages:
pip install -r requirements.txtNote: For GPU acceleration, install faiss-gpu instead of faiss-cpu:
pip install faiss-gpuProcess videos in the current directory and cluster them:
python video_clustering.pypython video_clustering.py \
--data-dir . \
--output-dir output \
--n-clusters 10 \
--max-videos 1000 \
--skip-frames 4 \
--pca-components 128 \
--extract-archives--data-dir: Directory containing video files or archives (default: current directory)--output-dir: Output directory for results (default:output)--n-clusters: Number of clusters for K-Means (default: 10)--max-videos: Maximum number of videos to process (default: all)--skip-frames: Number of frames to skip between samples (default: 4)--pca-components: Number of PCA components for dimensionality reduction (default: 128)--no-pca: Disable PCA dimensionality reduction--extract-archives: Extract videos from tar archives (if archives are detected)--load-embeddings: Load pre-computed embeddings from file (saves time on re-runs)--normalize: L2-normalize embeddings before clustering (recommended for better results)--clustering-method: Clustering algorithm -kmeans,dbscan,hdbscan, oragglomerative(default:kmeans)--dbscan-eps: DBSCAN eps parameter (default: 0.5)--dbscan-min-samples: DBSCAN min_samples parameter (default: 5)--hdbscan-min-cluster-size: HDBSCAN min_cluster_size parameter (default: 5)--use-cosine: Use cosine similarity for FAISS index (requires--normalize)
python video_clustering.py --max-videos 100 --n-clusters 5python video_clustering.py \
--n-clusters 20 \
--skip-frames 8 \
--pca-components 256 \
--output-dir results# First run: extract embeddings
python video_clustering.py --max-videos 500
# Second run: use saved embeddings with different cluster count
python video_clustering.py \
--load-embeddings output/video_embeddings.npz \
--n-clusters 15The pipeline automatically evaluates cluster quality using multiple metrics:
- Silhouette Score: Measures how well videos are separated into clusters (range: -1 to 1, higher is better)
- Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance (higher is better)
- Davies-Bouldin Index: Average similarity ratio of clusters (lower is better)
You can also evaluate clusters separately:
python evaluate_clusters.py --results-file output/clustering_results.npzQuery for similar videos using the pre-built FAISS index:
python query_similar_videos.py --query-index 0 --k 5python query_similar_videos.py --query-video path/to/video.mp4 --k 5This is useful for:
- Content-based recommendation: Find videos similar to a given video
- Duplicate detection: Identify near-duplicate videos
- Cluster exploration: Understand what makes videos similar
The pipeline generates several output files in the output directory:
video_embeddings.npz: Saved video embeddings and file pathsclustering_results.npz: Cluster labels, reduced embeddings, and quality metricsfaiss_index.bin: FAISS index for fast similarity searchcluster_visualization.png: 2D UMAP visualization of clusters
- Video Loading: Finds video files in the specified directory or extracts from archives
- Feature Extraction:
- Samples frames from each video (with configurable skip rate)
- Extracts embeddings using ResNet-50 (pre-trained on ImageNet)
- Averages frame embeddings to get a single vector per video
- Dimensionality Reduction: Optional PCA to reduce noise and speed up clustering
- Clustering: K-Means clustering groups similar videos
- Quality Evaluation: Computes multiple metrics to assess cluster quality
- Representative Selection: Finds videos closest to cluster centroids
- Similarity Search: FAISS index enables fast nearest neighbor queries
- Visualization: UMAP projects high-dimensional embeddings to 2D for visualization
- GPU Acceleration: The pipeline automatically uses GPU if available for faster embedding extraction
- Batch Processing: For large datasets, process in batches using
--max-videos - Skip Frames: Increase
--skip-framesfor faster processing (at cost of some temporal information) - PCA: Use PCA to reduce dimensionality and speed up clustering (recommended for >1000 videos)
For larger datasets:
- Distributed Processing: Process videos in parallel across multiple machines
- Vector Database: Use Milvus or Pinecone for millions of vectors
- Video Models: Consider I3D, C3D, or TimeSformer for temporal embeddings (better for action recognition)
- Cloud Storage: Store videos in S3/GCS and process in batches
- Silhouette Score > 0.3: Generally indicates reasonable cluster separation
- Calinski-Harabasz Index: Higher values indicate better-defined clusters
- Davies-Bouldin Index < 1.0: Lower values indicate better cluster separation
If clusters are not meaningful:
- Enable normalization: Use
--normalizeflag to L2-normalize embeddings (often improves results) - Use cosine similarity: Combine
--normalizeand--use-cosinefor better similarity search - Try alternative clustering methods:
--clustering-method dbscan: Good for uneven cluster sizes, automatically finds number of clusters--clustering-method hdbscan: Hierarchical DBSCAN, handles varying density--clustering-method agglomerative: Hierarchical clustering with specified number of clusters
- Adjust number of clusters: Try different
--n-clustersvalues - Increase embedding quality:
- Reduce
--skip-framesto capture more temporal information - Use video-based models (I3D, TimeSformer) instead of frame-level (future enhancement)
- Reduce
- Manual inspection: Review cluster representatives to understand what's being grouped
For improved cluster quality, try:
python video_clustering.py \
--max-videos 100 \
--n-clusters 10 \
--normalize \
--use-cosine \
--pca-components 128Or for uneven cluster sizes:
python video_clustering.py \
--max-videos 100 \
--clustering-method dbscan \
--dbscan-eps 0.3 \
--dbscan-min-samples 3 \
--normalize \
--use-cosine- Out of Memory: Reduce
--max-videosor increase--skip-frames - Slow Processing: Enable GPU or reduce number of videos
- No Videos Found: Check
--data-dirpath and ensure videos are in supported formats (.mp4, .avi, .mov, .webm) - Split Archive Errors: If you have files like
20bn-something-something-v2-00and20bn-something-something-v2-01, these are split archives. The script will automatically detect and concatenate them when using--extract-archives. If extraction fails, the files might be:- Raw video files (try without
--extract-archives) - A different archive format (may need manual extraction)
- Corrupted or incomplete downloads
- Raw video files (try without
- Low Cluster Quality Scores:
- Enable
--normalizeand--use-cosinefor better similarity - Try
--clustering-method dbscanorhdbscanfor uneven cluster sizes - Adjust
--n-clusters, reduce--skip-frames, or disable PCA with--no-pca
- Enable
- Dimension Mismatch in Queries: The query script now automatically handles PCA transformation. Make sure you're using the same embedding model as during clustering.
This project is provided as-is for research and educational purposes.