The corresponding paper for this repository can be found here.
Inspired by the current lack of existing emoji embedding models and their limited understanding of the nature of the evolving emotional content of the emoji, we have created novel emoji embeddings using their emotional content from their dictionary meanings. The subsequent emoji embeddings are generally more accurate than the state-of-the-art embeddings when tested on the task of sentiment analysis.
As these embeddings were also trained on keywords, the subsequent embeddings are durable and can be used in different natural language tasks such as emotion, cyberbully and sarcasm detection successfully. The current embedding file contains all emojis as of v13.1 from Unicode.org (1816 emojis). The emoji embedding file will be updated when new emojis are added.
We scraped the key emotive words from the online emoji dictionaries Emojipedia and Emojis.Wiki and created a new dataset. This is the script we used to scrape each emoji description from these websites. By using a list of uniquely emotive, sensory and other keywords we were able to use the Python library Beautiful Soup to scrape any matched words for each emoji description.
The final dataset structure looks like this: 🔮 crystal ball future magic mysterious
The full dataset can be found here.
In order for the model to train the data, the data needs to be in a tab-delimited, newline-delimited format:
- crystal ball 🔮 True
- magic 🔮 True
- mysterious 🔮 True
To achieve this we created a change dataset format script which also shuffles the data.
To make quality embeddings, we created negative samples.
- ripe fruits🔮 False
- dirt 🔮 False
- approval 🔮 False
Our full dataset consists of 10854 true samples and 890 false samples. We use a 91.8% train, 4.1% test, 4.1% develop split.
The data used to train the model can be found here.
- train.txt consists of 9964 true samples.
- test.txt consists of 445 true samples and 445 false samples.
- dev.txt consists of 445 true samples and 445 false samples which are different from then test.txt.
- train.txt uses 20 true samples
- test.txt uses 20 true samples
- dev.txt uses 20 true samples
The testing folder contains 20 identical true samples.
We used a PyTorch implementation of emoji2vec [1]. The original implementation of emoji2vec can be found here [2]. The model will generate emoji vectors with dimension 300, training in batches of 8, 4 positive and 4 negative examples at a learning rate of 0.001. The model performs early-stopping on a held-out development set using 60 epochs of training. Various metrics, including an accuracy and F1 score are outputted.
We downloaded the repository of the PyTorch implementation of emoji2vec [1] and updated the file 'presentation.ipynb'. We replaced the data folder with our new data and downloaded pretrained word vectors Google News word2vec to run this implementation.
If the file ‘phrase_embeddings.pkl’ exists in the ‘pre-trained’ folder, it needs to be deleted as this will allow a new dictionary to be created from the new dataset. The file ‘presentation.ipynb’ is run to train the emoji embeddings. This implementation of the model will produce our emojional embeddings.
We downloaded the repository of emoji2vec [2] and updated several files to current Python standards. We tested different versions of our emoji embeeding output files by adding them to the folder 'data/word2ec', as well as a copy of the Google News word2vec embeddings. The file 'TwitterClassfication.ipynb' executes the testing.
We compared our emoji embeddings to the state-of-the-art emoji embeddings using a Twitter sentiment analysis task on a 2015 dataset. Our emojional embeddings generally beat other embeddings using Random Forests and scored the second highest using Linear SVM.
We have evaluated the emoji embeddings on a list of emotions, sensations, feelings and keywords. Each emoji embeddings can be seen to successfully display multiple senses.
We also present our results in the form of t-SNE visualisation where you can see clusters of emotions in 2D space. We used the Microsoft repository emoji2recipe[3] and updated the 'Visualisation.ipynb' script to work with current package standards.
To use the embedding you need to download the emojional.bin file and include the following code within your model.
import gensim
e2v = gensim.models.KeyedVectors.load_word2vec_format("emojional.bin", binary=True)
[1]”pwiercinski/emoji2vec_pytorch", GitHub. [Online]. Available: https://github.com/pwiercinski/emoji2vec_pytorch. [Accessed: 30- Mar- 2021].
[2]”uclnlp/emoji2vec", GitHub. [Online]. Available: https://github.com/uclnlp/emoji2vec. [Accessed: 30- Mar- 2021].
[3]”microsoft/Emoji2recipe", GitHub. [Online]. Available: https://github.com/microsoft/Emoji2recipe. [Accessed: 30- Mar- 2021].