Speech Emotion Recognition in Python

Hey ML enthusiasts,  how could a machine judge your mood on the basis of the speech as humans do?

In this article, we are going to create a Speech Emotion Recognition, Therefore, you must download the Dataset and notebook so that you can go through it with the article for better understanding.


  • Keras
  • Librosa (For Audio Visualisation)


Audio can be visualize as waves passing over time and therefore by using their values we can build a classification system. You can see below the images of the waves of one of the audios in the dataset.

Speech Emotion Recognition in Python

We are going to represent our audio in forms of 3 features:

  • MFCC: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound.
  • Chroma: Represents 12 different pitch classes.
  • Mel: Spectrogram Frequency

Python Program: Speech Emotion Recognition

def extract_feature(file_name, mfcc, chroma, mel):
        X,sample_rate = ls.load(file_name)
        if chroma:
        if mfcc:
            mfccs=np.mean(ls.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(ls.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(ls.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
        return result

In the above code, we have defined a function to extract features because we have discussed earlier, Audio Feature representation.

Now, we are going to create our features and Label dataset.

for file in audio_files:
       file_name = file.split('/')[-1]
       if emotion not in our_emotion:
       feature=extract_feature(file, mfcc=True, chroma=True, mel=True)

When you will download the dataset, you will get to know the meanings of the names of the audio files as they are representing the audio description. Therefore, we have to split the file name for the feature extraction ass done above for the emotions label.

Now, we will normalize our dataset using the MinMaxScaler function of the sklearn.preprocessing library.

from sklearn.preprocessing import MinMaxScaler
scaler  =  MinMaxScaler()
x = scaler.fit_transform(x)

We are going to define the Architecture of the Model:

model = Sequential()
Model: "sequential_10"
Layer (type)                 Output Shape              Param #   
dense_35 (Dense)             (None, 180, 256)          512       
dense_36 (Dense)             (None, 180, 512)          131584    
dropout_10 (Dropout)         (None, 180, 512)          0         
dense_37 (Dense)             (None, 180, 512)          262656    
dense_38 (Dense)             (None, 180, 256)          131328    
flatten_8 (Flatten)          (None, 46080)             0         
dense_39 (Dense)             (None, 4)                 184324    
Total params: 710,404
Trainable params: 710,404
Non-trainable params: 0

We going to train it and got an accuracy of 70% both on training and testing datasets.

You can increase the accuracy of the model by Hyperparameter tuning. Congrats, we have created a Speech Recognition Model. For further projects, visit here

If you have any doubts or suggestions you are most welcome, please drop your views in the comments box.

Leave a Reply

Your email address will not be published. Required fields are marked *