GitHub - oluwayetty/Chinese-Word-Segmentation: Chinese-Word-Segmentation using wang2vec embeddings

Problem

Implemented a state-of-the-art word segmenter model in Tensorflow/Keras using Chinese characters. The image below shows the summary of this project, the input, and the corresponding output. The BIES format is a way to encode the output of a word segmenter model. There are 4 classes the model has to predict.

Example in the English world

Input: This is a NLP project AND Output: BIIE BE S BIE BIIIIIE

B means Beginning of a word
I mean In the middle of a word
E means End of a word
S means single, more examples of this are ".", "a", "," etc.

Dataset Description

I made use of the datasets which can be downloaded here. The full dataset contains four smaller datasets:

AS (Traditional Chinese)
CITYU (Traditional Chinese)
MSR (Simplified Chinese)
PKU (Simplified Chinese) Note that you are responsible to convert the Traditional Chinese datasets to Simplified Chinese by Installing HanziConv and run the following command: hanzi-convert -s infile > outfile

Repository skeleton

- code               # this folder contains all the code related to this project
- resources          # this folder contains the best saved model and its weights 
- README.md          # this file
- Homework_2_nlp.pdf # the slides for the course homework instruction
- report.pdf         # my report which basically analyzed the code and the results obtained.

Link to paper - (https://www.aclweb.org/anthology/D18-1529/)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
code		code
resources		resources
sample_files		sample_files
.gitignore		.gitignore
Homework_1_nlp.pdf		Homework_1_nlp.pdf
README.md		README.md
bies.jpg		bies.jpg
model.jpg		model.jpg
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem

Example in the English world

Dataset Description

Repository skeleton

About

Releases

Packages

Languages

oluwayetty/Chinese-Word-Segmentation

Folders and files

Latest commit

History

Repository files navigation

Problem

Example in the English world

Dataset Description

Repository skeleton

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages