This contest is to distinguish human writing or robot writing from articles, and we won the champion out of 240 teams.
Given an article, we need to create algorithms that judge types of authors (automatic summary, machine translation, robot writer or human writer). More details see SMP EUPT 2018
- tensorflow >= 1.4.0
- keras >= 1.2.0
- gensim
- scikit-learn
you may need keras.utils.vis_utils for model visualization
my_utils/
: for data preprocessingmy_utils/data
: convert origin data to csv filemy_utils/data_preprocess
: create data sequences and batches for the input of deep learning modelsmy_utils/w2v_process
: get the vocabs and pre-trained embeddings for words and charsmy_utils/metrics
: calcuate the precision, recall and F1 scores for each categories of authors
There are total 12 models that combine word representations and character representations.
The best model word rcnn char cgru
we devised is spired by two papers:
- A Hybrid Framework for Text Modeling with Convolutional RNN
- A C-LSTM Neural Network for Text Classfication
Here is the scores of different models:
model | off-line | on-line |
---|---|---|
word_char_cnn | 0.9888 | 0.9849 |
word_char_rnn | 0.9894 | 0.9863 |
deep_word_char_cnn | 0.9887 | 0.9828 |
word_rcnn_char_rnn | 0.9899 | 0.9879 |
word_rnn_char_rcnn | 0.9902 | 0.9872 |
word_char_cgru | 0.9896 | 0.9861 |
word_cgru_char_rcnn | 0.9904 | untested |
word_rcnn_char_cgru | 0.9910 | 0.9882 |
word_cgru_char_rnn | 0.9887 | untested |
word_rnn_char_cgru | 0.9899 | untested |
word_rnn_char_cnn | 0.9897 | 0.9862 |
word_char_rcnn | 0.9894 | 0.9884 |
- Note that rcnn comes from
A Hybrid Framework for Text Modeling with Convolutional RNN
while cgru comes fromA C-LSTM Neural Network for Text Classfication
The source codes derives from https://github.com/fuliucansheng/360
We use model
to create the architectures of models, and use train
to train them
We use LightGBM for ensemble combined 12 models and extra statistical features, which is in ensemble
, more details seen in https://github.com/TFknight/SMP-2018-Ensemble-Guide
In test dataset, we only adopt a simple but efficient voting mechanism for ensembling, which is in evaluate/predict
my_utils/
: for data preprocessingmy_utils/data
: convert origin data to csv filemy_utils/data_preprocess
: create data sequences and batches for the input of deep learning modelsmy_utils/w2v_process
: get the vocabs and pre-trained embeddings for words and charsmy_utils/metrics
: calcuate the precision, recall and F1 scores for each categories of authors
models/
: for creating deep learning modelsdeepzoo
: for keeping all models
init/config.py
: for saving the path of models, data and so ontrain
: for training modelsfigure
: for saving the visualization of models
Thanks for all the efforts of my teammates in GDUFS-iiip
We hope that more people will join in our labs: Data Mining Lab in GDUFS(广外数据挖掘实验室)