Warning
Sadly, Gensim and SpaCy do not use the same version of numpy right now, so the requirements are for an older version of spacy and scikit-learn... Hopefully, things will be cleared soon.
If you want to train a transformers model on GPU with SpaCy, you need to download extra libraries. See here for more informations. You also need to choose and download one spacy model, which will be use to preprocess the corpus for topic modeling and training a classification model. If you want to use the prodi.gy library, See here to follow the installation steps. Last step, create your (two) python environment(s).
# venv
python -m venv my_env
source my_env/bin/activate
pip install -r requirements.txt
Before running the scripts, it is recommended to have an idea of your corpus structure. At the end, you will have 3 files:
- one with the videos" metadata, captions and gensim annotation;
- one with the comments" metadata, perspective api annotation and agree-disagree annotation;
- one with the commentators" metadata. No need to worry about directories, they will be created when saving files or models.
You need to get a key to access Youtube Data API v3 and another one to access Perspective API. You can also request an increase of quota for youtube or perspective, if you are particulary impatient or are scrapping a big youtube channel. I cannot garantee your requests will be granted.
At last, you need to set up the .minetrc file in the directory where you will run the scripts (better just outside of ./echosis/*.py
). Minet is needed to scrap youtube and get the corpus.
Tip
The easiest way is to make a json file with this one line : {"youtube": {"key": ["your_api_key"]}}
All functions are commented, and Python files are in the docs directory to show you how to import and use every part of the processing chain. Soon, you will be able to use the framework through a command-line interface.
Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, Amélie Pellé, Laura Miguel, César Pichon, & Kelly Christensen. (2019). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399
Gensim. https://radimrehurek.com/gensim/models/ldamodel.html
Perspective API. https://current.withgoogle.com/the-current/toxicity/.
SpaCy. https://spacy.io/