This project uses text, audio, and visual cues extracted from a video to classify sentiment as Positive, Negative, or Neutral. It leverages powerful models like BERT, LSTM, and CNN to encode features from each modality and fuses them to predict the emotion.
- 🔤 Text encoding with BERT (
transformers) - 🎧 Audio encoding with MFCC + LSTM
- 🎞️ Visual encoding with CNN (OpenCV)
- 🤖 Multimodal fusion for final sentiment prediction
- 🛠 Extracts and processes audio/video automatically
- Clone the repository
git clone https://github.com/your-username/multimodal-sentiment-analysis.git
cd multimodal-sentiment-analysis - Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows- Install dependencies
pip install -r requirements.txtTo run the code in this repository, you need to have the following libraries inside requirements.txt:
- torch
- transformers
- librosa
- opencv-python
- subprocess
- torchvision
- numpy
- os
After upload an video data
- 👤 Vikash Kumar
