Speech Emotion Recognition in Python
Hey ML enthusiasts, how could a machine judge your mood on the basis of the speech as humans do?
In this article, we are going to create a Speech Emotion Recognition, Therefore, you must download the Dataset and notebook so that you can go through it with the article for better understanding.
REQUIREMENTS:
- Keras
- Librosa (For Audio Visualisation)
AUDIO AS FEATURE, HOW?
Audio can be visualize as waves passing over time and therefore by using their values we can build a classification system. You can see below the images of the waves of one of the audios in the dataset.
We are going to represent our audio in forms of 3 features:
- MFCC: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound.
- Chroma: Represents 12 different pitch classes.
- Mel: Spectrogram Frequency
Python Program: Speech Emotion Recognition
def extract_feature(file_name, mfcc, chroma, mel): X,sample_rate = ls.load(file_name) if chroma: stft=np.abs(ls.stft(X)) result=np.array([]) if mfcc: mfccs=np.mean(ls.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0) result=np.hstack((result, mfccs)) if chroma: chroma=np.mean(ls.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0) result=np.hstack((result, chroma)) if mel: mel=np.mean(ls.feature.melspectrogram(X, sr=sample_rate).T,axis=0) result=np.hstack((result, mel)) return result
In the above code, we have defined a function to extract features because we have discussed earlier, Audio Feature representation.
Now, we are going to create our features and Label dataset.
x,y=[],[] for file in audio_files: file_name = file.split('/')[-1] emotion=emotion_dic[file_name.split("-")[2]] if emotion not in our_emotion: continue feature=extract_feature(file, mfcc=True, chroma=True, mel=True) x.append(feature) y.append(emotion)
When you will download the dataset, you will get to know the meanings of the names of the audio files as they are representing the audio description. Therefore, we have to split the file name for the feature extraction ass done above for the emotions label.
Now, we will normalize our dataset using the MinMaxScaler function of the sklearn.preprocessing library.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() x = scaler.fit_transform(x)
We are going to define the Architecture of the Model:
model = Sequential() model.add(Dense(256,input_shape=(x.shape[1],1))) model.add(Dense(512)) model.add(Dropout(0.25)) model.add(Dense(512)) model.add(Dense(256)) model.add(Flatten()) model.add(Dense(4,activation='softmax')) model.summary()
Model: "sequential_10" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_35 (Dense) (None, 180, 256) 512 _________________________________________________________________ dense_36 (Dense) (None, 180, 512) 131584 _________________________________________________________________ dropout_10 (Dropout) (None, 180, 512) 0 _________________________________________________________________ dense_37 (Dense) (None, 180, 512) 262656 _________________________________________________________________ dense_38 (Dense) (None, 180, 256) 131328 _________________________________________________________________ flatten_8 (Flatten) (None, 46080) 0 _________________________________________________________________ dense_39 (Dense) (None, 4) 184324 ================================================================= Total params: 710,404 Trainable params: 710,404 Non-trainable params: 0 _________________________________________________________________
We going to train it and got an accuracy of 70% both on training and testing datasets.
You can increase the accuracy of the model by Hyperparameter tuning. Congrats, we have created a Speech Recognition Model. For further projects, visit here
If you have any doubts or suggestions you are most welcome, please drop your views in the comments box.
Leave a Reply