There is a 1 Second Delay in classification
This software uses the tonal qualities of speech to determine the person's emotions. It is intended for use in voice-only scenarios, since using Audio-Visual input can drastically improve accuracy and generalize the AI's result in a wider range of inputs.
Python libraries such as librosa
, scikit
, pyaudio
, wave
, numpy
are used in the production application.
Additionally, python libraries such as tensorflow
and pydub
were used in the research stage.
Windows users can download a zipped file from zippyshare (External)
Click Here To Download Zip
Though the major updates will be maintained on zippyshare link, smaller changes such as new models will not. Thus, I encourage you to use method 2.
git clone https://github.com/freakingrocky/EmoCh.git
cd EmoCh
python emoch.py
You can use the command-line-interface to test out any changes you made:
python live_classifier.py
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
Only the audio is used from this dataset. This dataset's audio is in North American accent.
Toronto emotional speech set (TESS)
This dataset consists of 2 people of 26 and 64 years from the Toronto area (Canada)
Initially, tonnetz
was also used but has been removed as it yielded low accuracy boost.
The features are extracted and provided as input to the various models for training.
The extracted features are as follows:
- mfcc: Mel Frequency Cepstral Coefficient
Significance: This can be used a heuristic metric for short term energy levels in voice. - mel: Mel Frequency Scale
Significance: The mel scale is a heuristic scale of pitches, as heard by people. - chroma: Chroma
Significance: This is a metric representing the octave of voice.
Once the features were extracted, various viable experiments were done to see which AI or ML model is best suited for the task.
Various Machine Learning and Artificial Intelligence models were trained and tested. Various different approaches were taken and data augmentation was done as per the model.
Here are there accuracy results on the randomly selected testing set:
- Support Vector Machine
Best Accuracy Achieved: 77.26% - Nearest Neighbor Classifier (One Node)
Best Accuracy Achieved: 88.77% - K-Nearest Neighbor Classifier (4 Nodes)
Best Accuracy Achieved: 86.70% - Naive Bayes Classifier
Best Accuracy Achieved: 58.02% - Deep Neural Network (with no user-defined features)
Best Accuracy Achieved: 91.81%
Extremely High Model Load Times; Requires high processing power. (Not Suitable for intended use-case)
Also, this is probably the result of overfitting. - Multi-Layer Perceptron (Convolutional Neural Network for internal classification)
Best Accuracy Achieved: 89.62%
From the results above, it is clear that Multi-Layer Perceptron is best suited for the task at hand.
The GUI is made using PySide2
, which is licensed under the LGPL (GNU Lesser General Public License) license.
The splash screen checks the mic on the system and starts the main application, which immediately starts 1-sec delayed classification based on mic audio.
- Train on More Data
- Train on data with noise AND noise removal AI integration
- Real Time audio stream classification (Current Problem is not enough samples in real-time)
- An AI/ML Classifier is only as good as the data it has trained on. The data used in this project is the open-source data designed for this use-case, which consists of people from western countries. The accuracy results are based entirely on this data.
- People from different geographical regions may have different tonal qualities in their voice for different emotions, this means there will be a bias based on geographical location and the data it was trained on.
- 1 Second Delay in classification