Speech Emotion Recognition in Python Using Machine Learning

Speech emotion recognition is an act of recognizing human emotions and state from the speech often abbreviated as SER. It is an algorithm to recognize hidden feelings through tone and pitch. By using this system we will be able to predict emotions such as sad, angry, surprised, calm, fearful, neutral, regret, and many more using some audio files.

Firstly, we will load the dataset, extract audio features from it, split into training and testing sets. Then we will initialize an ML model as a classifier and train them. At last, we will calculate the accuracy.

In this project, I have used the Jupyter notebook to implement this(Install Anaconda or Miniconda for this).

We are going to need some packages and libraries:

1)Numpy-for linear algebraic operations.

2)Scikit-learn-includes many statistical models.

3)Librosa-to extracts audio features.

4)Soundfile-to read and write sound files a well as to represent audio data as NumPy array.

5)pyAudio-to play or record audio.

So, let’s start with step by step implementation.

Step 1- Installing and Importing packages

Open Anaconda prompt and type these following commands:-

conda install -c numba numba
install -c conda-forge librosa
conda install numpy,pyaudio,scikit-learn==0.19
conda install -c conda-forge pysoundfile

Let us import them

(Try to install scikit-learn version 0.19 or else you will face issues in a later stage)

import soundfile
import numpy as np 
import librosa  
import glob 
import os # to use operating system dependent functionality
from sklearn.model_selection import train_test_split # for splitting training and testing 
from sklearn.neural_network import MLPClassifier # multi-layer perceptron model 
from sklearn.metrics import accuracy_score # to measure how good we are

Now we need a dataset to train on, there are many datasets but most commonly used is the Ryerson Audio-visual Database of Emotional Speech and song dataset(RAVDESS). Let’s download them.

After downloading, we need to extract features from the sound file

Step 2- Extract features from the sound file

Define a function get_feature to extract features from sound files such as Mfcc, Mel, Chroma, and Contrast.

def get_feature(file_name,mfccs,mel,chroma,contrast):
        data, sample_rate = librosa.load(file_name)
        stft = np.abs(librosa.stft(data))
        mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sample_rate, n_mfcc=40).T, axis=0)
        mel = np.mean(librosa.feature.melspectrogram(data, sr=sample_rate).T,axis=0)
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
        return mfccs,mel,chroma,contrast


Step 3- Assigning labels to emotion

Now, we need to define a dictionary to hold numbers(to assign emotions to the numbers containing in the dataset) and another list to hold the emotions that we want to observe.

# emotions in dataset
list_emotion = {
    "01": "neutral",
    "02": "calm",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust",
    "08": "surprised"

# I am using only 3 emotions to observe,feel free to add more.
classify_emotions = {


Step 4- Training and testing data

Now define a function to load sound files from our dataset. We use the glob module to get all the pathnames of sound files. Put the full path of the dataset in the glob parameter and now we call the function train_test_split with these, the test size, and a random state value, and return that.

def load_data(test_size=0.2):
    feature, y = [], []
    for file in glob.glob("C:\\Users\\Documents\\ravdess data\\Actor_*\\*.wav"):
        basename = os.path.basename(file)  # get the base name of the audio file
        emotion = list_emotion[basename.split("-")[2]]   # get the emotion label
        if emotion not in classify_emotions:    # we allow only classify_emotions we set
                mfccs,mel,chroma,contrast = get_feature(file)
            except Exception as e:
                print ("Error encountered while parsing file: ", file)
            ext_features = np.hstack([mfccs,mel,chroma,contrast])
    # split the data to training and testing and return it
    return train_test_split(np.array(feature), y, test_size=test_size, random_state=9)

Let’s load 25% of testing data and 75% of training data using function load_data

feature_train, feature_test, y_train, y_test = load_data(test_size=0.25)

Now let’s get the samples

# using get_features() function
print("Number of samples in training data:", feature_train.shape[0])

print("Number of samples in testing data:", feature_test.shape[0])


Number of samples in training data:462
Number of samples in testing data:169

Step 5- Initialize ML model

It’s time to initialize a Multi-layer perceptron classifier(MLP) with its hyperparameters. You can also use LSTM classifier(it’s all up to you)

print("Training the model.....")
clf=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500).fit(feature_train, y_train)


Training the model.....

Step 6- Calculate Accuracy

Finally, let’s calculate our accuracy

# predict 25% of data 
y_pred = clf.predict(feature_test)

# calculate the accuracy
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)

print("Accuracy is: {:.2f}%".format(accuracy*100))


Accuracy is:76.56%

And the number of features extracted

print("Number of features:", feature_train.shape[1])


Number of features:180


In this project, we learned to predict emotions using MLP classifier and used the librosa library to extract features from sound file and we obtained an accuracy of 76.56%.

Leave a Reply

Your email address will not be published. Required fields are marked *