The following project aims to demonstrate the feasibility of translating American Sign Language with a real-time approach.
The results obtained from the study of the problem are contained in the real-time demo application. Below, some GIFs are extracted from the webcam video stream.
In addition, a client-server web app has also been implemented.
The following GIF shows how it works.
Follow the instructions below to get a clean installation.
Download the WLASL dataset.
git clone https://github.com/dxli94/WLASL
Create and activate a new virtual environment in the project folder.
~/project_folder$ virtualenv .env
~/project_folder$ source .env/bin/activate
- Clone the repo.
(.env) git clone https://github.com/simonefinelli/ASL-Recognition-backup
- Install requirements.
(.env) python -m pip install -r requirements.txt
- Split the WLASL dataset in the right format using the script in
'tools/dataset splitting/'.
(.env) python k_gloss_splitting.py ./WLASL_full/ 2000
- Copy the pre-processed dataset in the 'data' folder.
Now let's see how to use the neural network, the demo and the web app.
- To start the training run:
(.env) python train_model.py
- After training, to evaluate the best model on the test-set, run:
(.env) python evaluate_model.py
- Now, we can use the model in the demo or for the web app.
- The WLASL dataset can be divided into 4 sub-datasets: WLASL100, WLASL300, WLASL1000 e WLASL2000. You can find the various models used for each sub-dataset in the models.py file.
- The custom frame generator used in the model needs at least 12 frames to work. However, videos 59958, 18223, 15144, 02914 and 55325, in the WLASL1000 and WLASL2000 datasets, are shorter. To solve this problem use the video_extender.py script.
- To start the demo run:
(.env) python demo.py
- To start the web app run:
(.env) python serve.py
- Go to the following URL: http://127.0.0.1:5000/
The model used in the demo and the web app was obtained by training the neural net on a custom dataset, called WLASL20custom. This dataset consists of only 20 words: book, chair, clothes, computer, drink, drum, family, football, go, hat, hello, kiss, like, play, school, street, table, university, violin and wall.
I achieved the following accuracy with the proposed models:
- WLASL20c: 63% of accuracy.
- WLASL100: 34% of accuracy.
- WLASL300: 28% of accuracy.
- WLASL1000: 19% of accuracy.
- WLASL2000: 10% of accuracy.
Distributed under the MIT License. See LICENSE
for more information.