Speech Emotion Recognition in Python Using Machine Learning

In this tutorial, we learn speech emotion recognition(SER). We making a machine learning model for SER.

Speech emotion recognition is an act of recognizing human emotions and state from the speech often abbreviated as SER. It is an algorithm to recognize hidden feelings through tone and pitch. By using this system we will be able to predict emotions such as sad, angry, surprised, calm, fearful, neutral, regret, and many more using some audio files.

Speech recognition is the technology that uses to recognize the speech from audio signals with the help of various techniques and methodologies. Recognition of emotion from speech signals is called speech emotion recognition. The emotion of the speech can recognize by extracting features from the speech. Extracting features from speech dataset we train a machine learning model to recognize the emotion of the speech we can make speech emotion recognizer(SER). There are different applications of SER like Surveys, Recommendation system, customer care services, etc;

We will do this same task in two different ways. In the first one, we will be using pyaudio. But in the second one we will not use this module. So check both of these methods.

 

Firstly, we will load the dataset, extract audio features from it, split into training and testing sets. Then we will initialize an ML model as a classifier and train them. At last, we will calculate the accuracy.

In this project, I have used the Jupyter notebook to implement this(Install Anaconda or Miniconda for this).

We are going to need some packages and libraries:

1)Numpy-for linear algebraic operations.

2)Scikit-learn-includes many statistical models.

3)Librosa-to extracts audio features.

4)Soundfile-to read and write sound files a well as to represent audio data as NumPy array.

5)pyAudio-to play or record audio.

So, let’s start with step by step implementation.

Step 1- Installing and Importing packages

Open Anaconda prompt and type these following commands:-

conda install -c numba numba
install -c conda-forge librosa
conda install numpy,pyaudio,scikit-learn==0.19
conda install -c conda-forge pysoundfile

Let us import them

(Try to install scikit-learn version 0.19 or else you will face issues in a later stage)

import soundfile
import numpy as np 
import librosa  
import glob 
import os # to use operating system dependent functionality
from sklearn.model_selection import train_test_split # for splitting training and testing 
from sklearn.neural_network import MLPClassifier # multi-layer perceptron model 
from sklearn.metrics import accuracy_score # to measure how good we are

Now we need a dataset to train on, there are many datasets but most commonly used is the Ryerson Audio-visual Database of Emotional Speech and song dataset(RAVDESS). Let’s download them.

You may also read:

Voice Command Calculator in Python using speech recognition and PyAudio

Text-To-Speech conversion in Python

After downloading, we need to extract features from the sound file.

Step 2- Extract features from the sound file

Define a function get_feature to extract features from sound files such as Mfcc, Mel, Chroma, and Contrast.

def get_feature(file_name,mfccs,mel,chroma,contrast):
    
        data, sample_rate = librosa.load(file_name)
        stft = np.abs(librosa.stft(data))
        mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sample_rate, n_mfcc=40).T, axis=0)
        mel = np.mean(librosa.feature.melspectrogram(data, sr=sample_rate).T,axis=0)
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
        
        
    
        return mfccs,mel,chroma,contrast

 

Step 3- Assigning labels to emotion

Now, we need to define a dictionary to hold numbers(to assign emotions to the numbers containing in the dataset) and another list to hold the emotions that we want to observe.

# emotions in dataset
list_emotion = {
    "01": "neutral",
    "02": "calm",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust",
    "08": "surprised"
}

# I am using only 3 emotions to observe,feel free to add more.
classify_emotions = {
    "sad",
    "happy",
    "surprised"
    
    
}

 

Step 4- Training and testing data

Now define a function to load sound files from our dataset. We use the glob module to get all the pathnames of sound files. Put the full path of the dataset in the glob parameter and now we call the function train_test_split with these, the test size, and a random state value, and return that.

def load_data(test_size=0.2):
    feature, y = [], []
    for file in glob.glob("C:\\Users\\Documents\\ravdess data\\Actor_*\\*.wav"):
        basename = os.path.basename(file)  # get the base name of the audio file
       
        emotion = list_emotion[basename.split("-")[2]]   # get the emotion label
       
        if emotion not in classify_emotions:    # we allow only classify_emotions we set
            try:
                mfccs,mel,chroma,contrast = get_feature(file)
            except Exception as e:
                print ("Error encountered while parsing file: ", file)
                continue
            ext_features = np.hstack([mfccs,mel,chroma,contrast])
            feature.append(ext_features)
            y.append(emotion)
        
    # split the data to training and testing and return it
    return train_test_split(np.array(feature), y, test_size=test_size, random_state=9)

Let’s load 25% of testing data and 75% of training data using function load_data

feature_train, feature_test, y_train, y_test = load_data(test_size=0.25)

Now let’s get the samples

# using get_features() function
print("Number of samples in training data:", feature_train.shape[0])

print("Number of samples in testing data:", feature_test.shape[0])

Output:

Number of samples in training data:462
Number of samples in testing data:169

Step 5- Initialize ML model

It’s time to initialize a Multi-layer perceptron classifier(MLP) with its hyperparameters. You can also use LSTM classifier(it’s all up to you)

print("Training the model.....")
clf=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500).fit(feature_train, y_train)

Output:

Training the model.....

Step 6- Calculate Accuracy

Finally, let’s calculate our accuracy

# predict 25% of data 
y_pred = clf.predict(feature_test)

# calculate the accuracy
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)

print("Accuracy is: {:.2f}%".format(accuracy*100))

Output:

Accuracy is:76.56%

And the number of features extracted

print("Number of features:", feature_train.shape[1])

Output:

Number of features:180

The second way of making a machine learning model for SER

Libraries of Python used in SER

Here, we are using Python language for programming. We are using the following libraries.

  • Soundfile: Soundfile is a Python package to read the audio file of different formats, for example, WAV, FLAC, OGG, MAT files.
  • Librosa:  Librosa is a Python package for audio and music analysis, for example, feature extraction and manipulation, segmentation, Visualization, and display.
  • Os: Os is a Python package for using an operating system, for example, obtain the base name of a file, open the file in different modes like reading, write, append
  • Glob: Glob is a Python package for finding path or pathnames of the file, the file having some specific pattern, For example, all files of.WAV extension.
  • Pickle:  Pickle is a Python package for implements binary protocol. For example, serializing and de-serializing Python object structure.
  • Numpy: Numpy is a Python package for scientific calculation, for example performing different operations on matrix.
  • Sklearn: Sklearn is a Python package for performing different machine learning operations, for example predicting the unknown future values.

Implementation of speech emotion recognition

Importing libraries

We need some dependency for SER, therefore, import the libraries used for making SER.

#importing libraries
import soundfile as sf    #to read audio file
import librosa            #to feature extraction
import os                 #to obtain the file
import glob               #to obtain filename having the same pattern
import pickle             #to save the model
import numpy as np
from sklearn.model_selection import train_test_split#to split train and test data
from sklearn.neural_network import MLPClassifier #multi layer perceptron classifier model
from sklearn.metrics import accuracy_score #to measure the accuracy

Feature extraction

For analyzing the emotion we need to extract features from audio. Therefore we are using the library Librosa. We are extracting mfcc, chroma, Mel feature from Soundfile.

Mfcc: Mel-frequency cepstral coefficients, identify the audio and discard other stuff like noise.

Chroma: used for harmonic and melodic characteristics of music, meaningfully characterized pitches of music in 12 different categories.

Mel: compute Mel spectrogram.

Opening file from soundfile.Soundfile and read sound from that. Samplerate for obtaining sample rate. If chroma is true then we are obtaining a Short-time Fourier transform of sound. After that extracting feature from Librosa.feature and get the mean value of that feature. Now, store this feature by calling the function hstack(). Hstack() stores the features returns at the end of the function.

#extracting features mfcc,chroma,mel from sound file
def feature_extraction(fileName,mfcc,chroma,mel):
    with sf.SoundFile(fileName) as file:
        sound = file.read(dtype='float32')#reading the sound file
        sample_rate = file.samplerate     #finding sample rate of sound
        if chroma:          #if chroma is true then finding stft
            stft = np.abs(librosa.stft(sound))
        feature = np.array([])                #initializing feature array
        if mfcc:
            mfcc = np.mean(librosa.feature.mfcc(y=sound,sr=sample_rate,n_mfcc=40).T,axis=0)
            feature =np.hstack((feature,mfcc))
        if chroma:
            chroma =  np.mean(librosa.feature.chroma_stft(S=stft,sr=sample_rate).T,axis=0)
            feature = np.hstack((feature,chroma))
        if mel:
            mel = np.mean(librosa.feature.melspectrogram(y=sound,sr=sample_rate).T,axis=0)
            feature =np.hstack((feature,mel))
        return feature  #return feature extracted from audio

Dataset

Here, we are using the REVDESS dataset. In this dataset, there are 24 actor’s voices with having different emotions. You can use any dataset from the internet. Search “SER Dataset”. The emotions we want are happy, sad, angry, neutral.

#All available emotion in dataset
int_emotion = {
    "01": "neutral",
    "02": "calm",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust",
    "08": "surprised"
}
#Emotions we want to observe
EMOTIONS = {"happy","sad","neutral","angry"}

Now, we get a train, test data from function train_test_data(). This makes train data and test data as per requirement. We make two arrays to obtain features and its emotion. We are using a glob to find all sound files with the pattern: “data/Actor_*/*.wav”. The third number in the file name of Soundfile is the number of emotion which can obtain from int_emotion. If the emotion is not in our wanted emotion then we continue to the next file. We extract feature from feature_extraction() function. Store them in features and emotion in emotions array.  In the end, function return data split into train and test data.

#making and spliting train and test data
def train_test_data(test_size=0.3):
    features, emotions = [],[] #intializing features and emotions
    for file in glob.glob("data/Actor_*/*.wav"):
        fileName = os.path.basename(file)   #obtaining the file name
        emotion = int_emotion[fileName.split("-")[2]] #getting the emotion
        if emotion not in EMOTIONS:
            continue
        feature=feature_extraction(file,mfcc=True,chroma=True,mel=True,) #extracting feature from audio
        features.append(feature)
        emotions.append(emotion)
    return train_test_split(np.array(features),emotions, test_size=test_size, random_state=7) #returning the data splited into train and test set
we are obtaining train and test data from train_test_data(). Here, the test size is 30%.
#dataset
X_train,X_test,y_train,y_test=train_test_data(test_size=0.3)
print("Total number of training sample: ",X_train.shape[0])
print("Total number of testing example: ",X_test.shape[0])
print("Feature extracted",X_train.shape[1])

Preparing Model

Initialize model of multilayer perceptron classifier.

#initializing the multi layer perceptron model
model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(400,), learning_rate='adaptive', max_iter=1000)

Fitting data into the model.

#fitting the training data into model
print("__________Training the model__________")
model.fit(X_train,y_train)

Obtaining the predicted value for the test set.

#Predicting the output value for testing data
y_pred = model.predict(X_test)

Now we check the accuracy of the model by accuracy score, to evaluate the model.

#calculating accuracy
accuracy = accuracy_score(y_true=y_test,y_pred=y_pred)
accuracy*=100
print("accuracy: {:.4f}%".format(accuracy))

Saving our model for future use.

#saving the model 
if not os.path.isdir("model"): 
   os.mkdir("model") 
pickle.dump(model, open("model/mlp_classifier.model", "wb"))

Output:

Libraries of Python used in SER Here, we are using Python language for programming. We are using the following libraries. Soundfile: Soundfile is a Python package to read the audio file of different formats, for example, WAV, FLAC, OGG, MAT files. Librosa:  Librosa is a Python package for audio and music analysis, for example, feature extraction and manipulation, segmentation, Visualization, and display. Os: Os is a Python package for using an operating system, for example, obtain the base name of a file, open the file in different modes like reading, write, append Glob: Glob is a Python package for finding path or pathnames of the file, the file having some specific pattern, For example, all files of.WAV extension. Pickle:  Pickle is a Python package for implements binary protocol. For example, serializing and de-serializing Python object structure. Numpy: Numpy is a Python package for scientific calculation, for example performing different operations on matrix. Sklearn: Sklearn is a Python package for performing different machine learning operations, for example predicting the unknown future values. Implementation of speech emotion recognition Importing libraries We need some dependency for SER, therefore, import the libraries used for making SER. #importing libraries import soundfile as sf #to read audio file import librosa #to feature extraction import os #to obtain the file import glob #to obtain filename having the same pattern import pickle #to save the model import numpy as np from sklearn.model_selection import train_test_split#to split train and test data from sklearn.neural_network import MLPClassifier #multi layer perceptron classifier model from sklearn.metrics import accuracy_score #to measure the accuracy Feature extraction For analyzing the emotion we need to extract features from audio. Therefore we are using the library Librosa. We are extracting mfcc, chroma, Mel feature from Soundfile. Mfcc: Mel-frequency cepstral coefficients, identify the audio and discard other stuff like noise. Chroma: used for harmonic and melodic characteristics of music, meaningfully characterized pitches of music in 12 different categories. Mel: compute Mel spectrogram. Opening file from soundfile.Soundfile and read sound from that. Samplerate for obtaining sample rate. If chroma is true then we are obtaining a Short-time Fourier transform of sound. After that extracting feature from Librosa.feature and get the mean value of that feature. Now, store this feature by calling the function hstack(). Hstack() stores the features returns at the end of the function. #extracting features mfcc,chroma,mel from sound file def feature_extraction(fileName,mfcc,chroma,mel): with sf.SoundFile(fileName) as file: sound = file.read(dtype='float32')#reading the sound file sample_rate = file.samplerate #finding sample rate of sound if chroma: #if chroma is true then finding stft stft = np.abs(librosa.stft(sound)) feature = np.array([]) #initializing feature array if mfcc: mfcc = np.mean(librosa.feature.mfcc(y=sound,sr=sample_rate,n_mfcc=40).T,axis=0) feature =np.hstack((feature,mfcc)) if chroma: chroma = np.mean(librosa.feature.chroma_stft(S=stft,sr=sample_rate).T,axis=0) feature = np.hstack((feature,chroma)) if mel: mel = np.mean(librosa.feature.melspectrogram(y=sound,sr=sample_rate).T,axis=0) feature =np.hstack((feature,mel)) return feature #return feature extracted from audio Dataset Here, we are using the REVDESS dataset. In this dataset, there are 24 actor’s voices with having different emotions. You can download a dataset from here: SER_Dataset The emotions we want are happy, sad, angry, neutral. #All available emotion in dataset int_emotion = { "01": "neutral", "02": "calm", "03": "happy", "04": "sad", "05": "angry", "06": "fearful", "07": "disgust", "08": "surprised" } #Emotions we want to observe EMOTIONS = {"happy","sad","neutral","angry"} Now, we get a train, test data from function train_test_data(). This makes train data and test data as per requirement. We make two arrays to obtain features and its emotion. We are using a glob to find all sound files with the pattern: “data/Actor_*/*.wav”. The third number in the file name of Soundfile is the number of emotion which can obtain from int_emotion. If the emotion is not in our wanted emotion then we continue to the next file. We extract feature from feature_extraction() function. Store them in features and emotion in emotions array.  In the end, function return data split into train and test data. #making and spliting train and test data def train_test_data(test_size=0.3): features, emotions = [],[] #intializing features and emotions for file in glob.glob("data/Actor_*/*.wav"): fileName = os.path.basename(file) #obtaining the file name emotion = int_emotion[fileName.split("-")[2]] #getting the emotion if emotion not in EMOTIONS: continue feature=feature_extraction(file,mfcc=True,chroma=True,mel=True,) #extracting feature from audio features.append(feature) emotions.append(emotion) return train_test_split(np.array(features),emotions, test_size=test_size, random_state=7) #returning the data splited into train and test set we are obtaining train and test data from train_test_data(). Here, the test size is 30%. #dataset X_train,X_test,y_train,y_test=train_test_data(test_size=0.3) print("Total number of training sample: ",X_train.shape[0]) print("Total number of testing example: ",X_test.shape[0]) print("Feature extracted",X_train.shape[1]) Preparing Model Initialize model of multilayer perceptron classifier. #initializing the multi layer perceptron model model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(400,), learning_rate='adaptive', max_iter=1000) Fitting data into the model. #fitting the training data into model print("__________Training the model__________") model.fit(X_train,y_train) Obtaining the predicted value for the test set. #Predicting the output value for testing data y_pred = model.predict(X_test) Now we check the accuracy of the model by accuracy score, to evaluate the model. #calculating accuracy accuracy = accuracy_score(y_true=y_test,y_pred=y_pred) accuracy*=100 print("accuracy: {:.4f}%".format(accuracy)) Saving our model for future use. #saving the model if not os.path.isdir("model"): os.mkdir("model") pickle.dump(model, open("model/mlp_classifier.model", "wb")) Output: https://drive.google.com/open?id=1Vw1KB0mEnab2tJObddG9uP6LgWrXCiCh Conclusion: In this tutorial, we learn the following topic: What is speech emotion recognition? Introduction of some Python libraries. Implementation of speech emotion recognition.

Conclusion:

In this tutorial, we learn the following topic:

  • What is speech emotion recognition?
  • Introduction of some Python libraries.
  • Implementation of speech emotion recognition.

In this project, we learned to predict emotions using MLP classifier and used the librosa library to extract features from sound file and we obtained an accuracy of 76.56%.

Leave a Reply

Your email address will not be published. Required fields are marked *