Voice Emotion Detection
Overview:
The Voice Emotion Detection project is an AI-based system that identifies human emotions (such as happiness, anger, sadness, fear, surprise, or neutrality) from voice recordings or real-time audio input.
Using speech signal processing, feature extraction (MFCC, Chroma, Spectral Contrast), and Machine Learning/Deep Learning models, the system analyzes tone, pitch, intensity, and rhythm of speech to determine the speaker’s emotional state.
This technology has wide applications in call centers, healthcare, human-computer interaction, sentiment analysis, and virtual assistants — making machines capable of understanding human emotions through voice.
Objectives:
-
To develop an AI system that can detect and classify emotions from speech.
-
To extract and analyze audio features for emotion recognition.
-
To build and train a Machine Learning or Deep Learning model for classifying emotions.
-
To demonstrate real-time voice emotion detection with an intuitive interface.
Key Features:
-
Emotion Classification: Detects multiple emotions such as happy, sad, angry, fear, calm, or neutral.
-
Audio Input Support: Accepts recorded voice clips or real-time microphone input.
-
Feature Extraction: Uses Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral features.
-
AI-Powered Model: Employs CNNs, RNNs, or LSTM networks for emotion classification.
-
Graphical Output: Displays emotion prediction with probability percentages.
-
Dataset Training: Trained on emotion datasets like RAVDESS, TESS, or SAVEE.
-
Interactive Interface: Simple web interface to record and analyze user speech.
-
Model Visualization: Shows accuracy and loss graphs during training.
-
Real-Time Processing: Instant prediction after user speaks or uploads audio.
-
Offline Capability: Can work without internet after model training.
Tech Stack:
-
Frontend: HTML, CSS, Bootstrap, JavaScript (with Web Audio API for recording)
-
Backend: Python (Flask / Django) / Node.js
-
Machine Learning / Deep Learning:
-
Libraries: TensorFlow, Keras, Librosa, scikit-learn, NumPy, Pandas, Matplotlib
-
Techniques: Audio Feature Extraction, Classification
-
Models: CNN, RNN, LSTM, or Hybrid Deep Learning models
-
-
Dataset:
-
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
-
TESS (Toronto Emotional Speech Set)
-
SAVEE (Surrey Audio-Visual Expressed Emotion Dataset)
-
-
Database (optional): MySQL / Firebase (for storing results or user data)
Workflow:
-
Data Collection:
-
Collect pre-labeled emotional voice samples from datasets (e.g., RAVDESS).
-
-
Preprocessing:
-
Convert audio to mono and fixed sample rate.
-
Extract features like MFCC, Chroma, Spectral Centroid, and Zero-Crossing Rate.
-
-
Model Training:
-
Train ML/DL models using extracted features and labeled emotions.
-
Use algorithms like CNN, RNN, or SVM for classification.
-
-
Prediction Phase:
-
Record or upload a new audio sample.
-
Extract features and classify the emotion using the trained model.
-
-
Output Visualization:
-
Display detected emotion (e.g., “Happy ”) with confidence score.
-