GitHub - mvpcom/ddCh2: Datadays 2019 Challenge 2

Programming by Mojtaba Valipour @ SUTest-V1.0.0, vpcom.ir

Title: Deep Hierarchical Persian Text Classification based on hdlTex

Information about the conda environment (Anaconda)

Environment: hdlTex, vpcomDesk -> hdlTex.yml (conda list --explicit > hdlTex.yml)
Python:3.5.6
Tensorflow: 1.10.0
Keras: 2.2.2
Pandas: 0.23.4
nltk: 3.3.0
numpy: 1.15.2
Cuda:9.0
GPU: Geforce GTX 1080
CPU: Intel® Core™ i7-2600K CPU @ 3.40GHz × 8
RAM: 12GB
OS: Ubuntu 16.04 LTS 64-bit

Main Objective

You can find the main objective as follows in the Persian Language:

Challenge2_DivarDataset_DataDays, Sharif University

بخش اول

عنوان: پیش بینی دسته بندی

امتیاز: ۳۰۰۰ امتیاز

توانایی: یادگیری ماشین و تحلیل متن

مسئله: پیشبینی دسته بندی آگهی از روی سایر ویژگی های آن

توصیف: در این بخش شما یک دیتاست شامل ۲۰۰ هزار سطر دانلود میکنید که هر سطر حاوی اطلاعات مربوط به یک آگهی است. شما باید دسته بندی سلسله مراتبی هر آگهی را به دست آورید و در قالب یک فایل csv که شامل ۲۰۰ هزار سطر و سه ستون cat1, cat2, cat3 است آپلود کنید.

ملاحظه مهم: ساختار پاسخ باید دقیقا به شکل اشاره شده باشد. ضمنا تمام دسته ها باید به همان شکلی که در دیتاست Train قرار دارد باشد. یک نمونه از پاسخ مطلوب در این فایل فایل پیوست شده است.

Hints:

dataDaysChallenge2-Github.ipynb is only for your reference to see how I prepared the main code. Some parts are not compatible with the recent changes!
dataDaysChallenge_BIGNet.py is the main code and all the other files are here for your reference only!
Make sure you have enough permission and free storage!
"preProcessFlag = True" should be true for the first run

Configuration Example:

You have to change the config vars based on your need:

Config Results on Test set:

All Categories Acc: 0.9433751213541007
Cat1 Acc: 0.98198683044194
Cat2 Acc: 0.9710966189692288
Cat3 Acc: 0.9520493014224811

epochs = 15; # Number of epochs to train the main model
level2Epochs = 25; # Number of epochs to train the level 2 models
level3Epochs = 40; # Number of epochs to train the level 3 models
MAX_SEQUENCE_LENGTH = 100; # Maximum sequance lentgh 500 words
MAX_NB_WORDS = 55000; # Maximum number of unique words
EMBEDDING_DIM = 300; # Embedding dimension you can change it to {25, 100, 150, and 300} but need the fasttext version in your directory
batch_size_L1 = int(3048/2); # batch size in Level 1
batch_size_L2 = int(3048/2); # batch size in Level 2
batch_size_L3 = int(3048/2); # batch size in Level 3
L1_model = 2; # Model Type: 0 is DNN, 1 is CNN, and 2 is RNN for Level 1
L2_model = 2; # Model Type: 0 is DNN, 1 is CNN, and 2 is RNN for Level 1
L3_model = 2; # Model Type: 0 is DNN, 1 is CNN, and 2 is RNN for Level 1
rnnType = 4; # RNN model, 0 GRU, 1 Conv   LSTM, 2: RNN DNN 3: Attention, 4: Big
trainingBigNetFlag = True; # one Model for all levels (allInONE ;P), Other Flags will be False automatically
testBigNetFlag = True; # one Model for all levels, Other Flags will be False automatically

Run:

source activate hdlTex;
python dataDaysChallenge_BIGNet.py

Inputs:

"./data/divar_posts_dataset.csv" # original dataset path, train set
"./data/phase_2_dataset.csv" # phase 2 dataset path, test set
'./fastText/*.vec'

Outputs:

'./wordDict.json' # where to save the extracted words dictionary
'./dataset/' # where to export processed files for later usage
'./dataChallenge/' # where to save processed files for phase2 Dataset
'./resultsChallenge2.csv' # where to save results
'./resultsChallenge2Inputs.csv' # where to save results and inputs
'./resultsChallenge2FixLevels.csv' # where to save fixed results, generally better performance
'./resultsChallenge2FixLevelsALL.csv' # Check all the samples hierarchy (L1,L2,L3)
'./table.html' # where to save all the results alongside inputs for visual judements

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
fastText		fastText
LICENSE		LICENSE
Readme.md		Readme.md
categoriesDivar.json		categoriesDivar.json
dataDaysChallenge2-Github.ipynb		dataDaysChallenge2-Github.ipynb
dataDaysChallenge_BIGNet.py		dataDaysChallenge_BIGNet.py
hdlTex.yml		hdlTex.yml
resultsChallenge2FixLevels.csv.zip		resultsChallenge2FixLevels.csv.zip
wordDict.json.zip		wordDict.json.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information about the conda environment (Anaconda)

Main Objective

Hints:

Configuration Example:

Run:

Inputs:

Outputs:

RESOURCES:

About

Releases

Packages

Languages

License

mvpcom/ddCh2

Folders and files

Latest commit

History

Repository files navigation

Information about the conda environment (Anaconda)

Main Objective

Hints:

Configuration Example:

Run:

Inputs:

Outputs:

RESOURCES:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages