Speech Emotion Recognition in Python Using Machine Learning
In this tutorial, we learn speech emotion recognition(SER). We making a machine learning model for SER.
Speech emotion recognition is an act of recognizing human emotions and state from the speech often abbreviated as SER. It is an algorithm to recognize hidden feelings through tone and pitch. By using this system we will be able to predict emotions such as sad, angry, surprised, calm, fearful, neutral, regret, and many more using some audio files.
Speech recognition is the technology that uses to recognize the speech from audio signals with the help of various techniques and methodologies. Recognition of emotion from speech signals is called speech emotion recognition. The emotion of the speech can recognize by extracting features from the speech. Extracting features from speech dataset we train a machine learning model to recognize the emotion of the speech we can make speech emotion recognizer(SER). There are different applications of SER like Surveys, Recommendation system, customer care services, etc;
We will do this same task in two different ways. In the first one, we will be using pyaudio. But in the second one we will not use this module. So check both of these methods.
Firstly, we will load the dataset, extract audio features from it, split into training and testing sets. Then we will initialize an ML model as a classifier and train them. At last, we will calculate the accuracy.
In this project, I have used the Jupyter notebook to implement this(Install Anaconda or Miniconda for this).
We are going to need some packages and libraries:
1)Numpy-for linear algebraic operations.
2)Scikit-learn-includes many statistical models.
3)Librosa-to extracts audio features.
4)Soundfile-to read and write sound files a well as to represent audio data as NumPy array.
5)pyAudio-to play or record audio.
So, let’s start with step by step implementation.
Step 1- Installing and Importing packages
Open Anaconda prompt and type these following commands:-
conda install -c numba numba install -c conda-forge librosa conda install numpy,pyaudio,scikit-learn==0.19 conda install -c conda-forge pysoundfile
Let us import them
(Try to install scikit-learn version 0.19 or else you will face issues in a later stage)
import soundfile import numpy as np import librosa import glob import os # to use operating system dependent functionality from sklearn.model_selection import train_test_split # for splitting training and testing from sklearn.neural_network import MLPClassifier # multi-layer perceptron model from sklearn.metrics import accuracy_score # to measure how good we are
Now we need a dataset to train on, there are many datasets but most commonly used is the Ryerson Audio-visual Database of Emotional Speech and song dataset(RAVDESS). Let’s download them.
You may also read:
Voice Command Calculator in Python using speech recognition and PyAudio
Text-To-Speech conversion in Python
After downloading, we need to extract features from the sound file.
Step 2- Extract features from the sound file
Define a function get_feature to extract features from sound files such as Mfcc, Mel, Chroma, and Contrast.
def get_feature(file_name,mfccs,mel,chroma,contrast): data, sample_rate = librosa.load(file_name) stft = np.abs(librosa.stft(data)) mfccs = np.mean(librosa.feature.mfcc(y=data, sr=sample_rate, n_mfcc=40).T, axis=0) mel = np.mean(librosa.feature.melspectrogram(data, sr=sample_rate).T,axis=0) chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0) contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0) return mfccs,mel,chroma,contrast
Step 3- Assigning labels to emotion
Now, we need to define a dictionary to hold numbers(to assign emotions to the numbers containing in the dataset) and another list to hold the emotions that we want to observe.
# emotions in dataset list_emotion = { "01": "neutral", "02": "calm", "03": "happy", "04": "sad", "05": "angry", "06": "fearful", "07": "disgust", "08": "surprised" } # I am using only 3 emotions to observe,feel free to add more. classify_emotions = { "sad", "happy", "surprised" }
Step 4- Training and testing data
Now define a function to load sound files from our dataset. We use the glob module to get all the pathnames of sound files. Put the full path of the dataset in the glob parameter and now we call the function train_test_split with these, the test size, and a random state value, and return that.
def load_data(test_size=0.2): feature, y = [], [] for file in glob.glob("C:\\Users\\Documents\\ravdess data\\Actor_*\\*.wav"): basename = os.path.basename(file) # get the base name of the audio file emotion = list_emotion[basename.split("-")[2]] # get the emotion label if emotion not in classify_emotions: # we allow only classify_emotions we set try: mfccs,mel,chroma,contrast = get_feature(file) except Exception as e: print ("Error encountered while parsing file: ", file) continue ext_features = np.hstack([mfccs,mel,chroma,contrast]) feature.append(ext_features) y.append(emotion) # split the data to training and testing and return it return train_test_split(np.array(feature), y, test_size=test_size, random_state=9)
Let’s load 25% of testing data and 75% of training data using function load_data
feature_train, feature_test, y_train, y_test = load_data(test_size=0.25)
Now let’s get the samples
# using get_features() function print("Number of samples in training data:", feature_train.shape[0]) print("Number of samples in testing data:", feature_test.shape[0])
Output:
Number of samples in training data:462 Number of samples in testing data:169
Step 5- Initialize ML model
It’s time to initialize a Multi-layer perceptron classifier(MLP) with its hyperparameters. You can also use LSTM classifier(it’s all up to you)
print("Training the model.....") clf=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive', max_iter=500).fit(feature_train, y_train)
Output:
Training the model.....
Step 6- Calculate Accuracy
Finally, let’s calculate our accuracy
# predict 25% of data y_pred = clf.predict(feature_test) # calculate the accuracy accuracy = accuracy_score(y_true=y_test, y_pred=y_pred) print("Accuracy is: {:.2f}%".format(accuracy*100))
Output:
Accuracy is:76.56%
And the number of features extracted
print("Number of features:", feature_train.shape[1])
Output:
Number of features:180
The second way of making a machine learning model for SER
Libraries of Python used in SER
Here, we are using Python language for programming. We are using the following libraries.
- Soundfile: Soundfile is a Python package to read the audio file of different formats, for example, WAV, FLAC, OGG, MAT files.
- Librosa: Librosa is a Python package for audio and music analysis, for example, feature extraction and manipulation, segmentation, Visualization, and display.
- Os: Os is a Python package for using an operating system, for example, obtain the base name of a file, open the file in different modes like reading, write, append
- Glob: Glob is a Python package for finding path or pathnames of the file, the file having some specific pattern, For example, all files of.WAV extension.
- Pickle: Pickle is a Python package for implements binary protocol. For example, serializing and de-serializing Python object structure.
- Numpy: Numpy is a Python package for scientific calculation, for example performing different operations on matrix.
- Sklearn: Sklearn is a Python package for performing different machine learning operations, for example predicting the unknown future values.
Implementation of speech emotion recognition
Importing libraries
We need some dependency for SER, therefore, import the libraries used for making SER.
#importing libraries import soundfile as sf #to read audio file import librosa #to feature extraction import os #to obtain the file import glob #to obtain filename having the same pattern import pickle #to save the model import numpy as np from sklearn.model_selection import train_test_split#to split train and test data from sklearn.neural_network import MLPClassifier #multi layer perceptron classifier model from sklearn.metrics import accuracy_score #to measure the accuracy
Feature extraction
For analyzing the emotion we need to extract features from audio. Therefore we are using the library Librosa. We are extracting mfcc, chroma, Mel feature from Soundfile.
Mfcc: Mel-frequency cepstral coefficients, identify the audio and discard other stuff like noise.
Chroma: used for harmonic and melodic characteristics of music, meaningfully characterized pitches of music in 12 different categories.
Mel: compute Mel spectrogram.
Opening file from soundfile.Soundfile and read sound from that. Samplerate for obtaining sample rate. If chroma is true then we are obtaining a Short-time Fourier transform of sound. After that extracting feature from Librosa.feature and get the mean value of that feature. Now, store this feature by calling the function hstack(). Hstack() stores the features returns at the end of the function.
#extracting features mfcc,chroma,mel from sound file def feature_extraction(fileName,mfcc,chroma,mel): with sf.SoundFile(fileName) as file: sound = file.read(dtype='float32')#reading the sound file sample_rate = file.samplerate #finding sample rate of sound if chroma: #if chroma is true then finding stft stft = np.abs(librosa.stft(sound)) feature = np.array([]) #initializing feature array if mfcc: mfcc = np.mean(librosa.feature.mfcc(y=sound,sr=sample_rate,n_mfcc=40).T,axis=0) feature =np.hstack((feature,mfcc)) if chroma: chroma = np.mean(librosa.feature.chroma_stft(S=stft,sr=sample_rate).T,axis=0) feature = np.hstack((feature,chroma)) if mel: mel = np.mean(librosa.feature.melspectrogram(y=sound,sr=sample_rate).T,axis=0) feature =np.hstack((feature,mel)) return feature #return feature extracted from audio
Dataset
Here, we are using the REVDESS dataset. In this dataset, there are 24 actor’s voices with having different emotions. You can use any dataset from the internet. Search “SER Dataset”. The emotions we want are happy, sad, angry, neutral.
#All available emotion in dataset int_emotion = { "01": "neutral", "02": "calm", "03": "happy", "04": "sad", "05": "angry", "06": "fearful", "07": "disgust", "08": "surprised" } #Emotions we want to observe EMOTIONS = {"happy","sad","neutral","angry"}
Now, we get a train, test data from function train_test_data(). This makes train data and test data as per requirement. We make two arrays to obtain features and its emotion. We are using a glob to find all sound files with the pattern: “data/Actor_*/*.wav”. The third number in the file name of Soundfile is the number of emotion which can obtain from int_emotion. If the emotion is not in our wanted emotion then we continue to the next file. We extract feature from feature_extraction() function. Store them in features and emotion in emotions array. In the end, function return data split into train and test data.
#making and spliting train and test data def train_test_data(test_size=0.3): features, emotions = [],[] #intializing features and emotions for file in glob.glob("data/Actor_*/*.wav"): fileName = os.path.basename(file) #obtaining the file name emotion = int_emotion[fileName.split("-")[2]] #getting the emotion if emotion not in EMOTIONS: continue feature=feature_extraction(file,mfcc=True,chroma=True,mel=True,) #extracting feature from audio features.append(feature) emotions.append(emotion) return train_test_split(np.array(features),emotions, test_size=test_size, random_state=7) #returning the data splited into train and test set we are obtaining train and test data from train_test_data(). Here, the test size is 30%.
#dataset X_train,X_test,y_train,y_test=train_test_data(test_size=0.3) print("Total number of training sample: ",X_train.shape[0]) print("Total number of testing example: ",X_test.shape[0]) print("Feature extracted",X_train.shape[1])
Preparing Model
Initialize model of multilayer perceptron classifier.
#initializing the multi layer perceptron model model=MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(400,), learning_rate='adaptive', max_iter=1000)
Fitting data into the model.
#fitting the training data into model print("__________Training the model__________") model.fit(X_train,y_train)
Obtaining the predicted value for the test set.
#Predicting the output value for testing data y_pred = model.predict(X_test)
Now we check the accuracy of the model by accuracy score, to evaluate the model.
#calculating accuracy accuracy = accuracy_score(y_true=y_test,y_pred=y_pred) accuracy*=100 print("accuracy: {:.4f}%".format(accuracy))
Saving our model for future use.
#saving the model if not os.path.isdir("model"): os.mkdir("model") pickle.dump(model, open("model/mlp_classifier.model", "wb"))
Output:
Conclusion:
In this tutorial, we learn the following topic:
- What is speech emotion recognition?
- Introduction of some Python libraries.
- Implementation of speech emotion recognition.
In this project, we learned to predict emotions using MLP classifier and used the librosa library to extract features from sound file and we obtained an accuracy of 76.56%.
Why did you used not in this if statement:
if emotion not in classify_emotions: