Implemented a state-of-the-art word segmenter model in Tensorflow/Keras using Chinese characters. The image below shows the summary of this project, the input, and the corresponding output. The BIES format is a way to encode the output of a word segmenter model. There are 4 classes the model has to predict.
Input: This is a NLP project AND Output: BIIE BE S BIE BIIIIIE
- B means Beginning of a word
- I mean In the middle of a word
- E means End of a word
- S means single, more examples of this are ".", "a", "," etc.
I made use of the datasets which can be downloaded here. The full dataset contains four smaller datasets:
- AS (Traditional Chinese)
- CITYU (Traditional Chinese)
- MSR (Simplified Chinese)
- PKU (Simplified Chinese)
Note that you are responsible to convert the Traditional Chinese datasets to Simplified Chinese by Installing HanziConv and run the following command:
hanzi-convert -s infile > outfile
- code # this folder contains all the code related to this project
- resources # this folder contains the best saved model and its weights
- README.md # this file
- Homework_2_nlp.pdf # the slides for the course homework instruction
- report.pdf # my report which basically analyzed the code and the results obtained.
Link to paper - (https://www.aclweb.org/anthology/D18-1529/)