Skip to content

This is my NLP project including two sub-projects

Notifications You must be signed in to change notification settings

JIANG-Wu-19/NLP_project

Repository files navigation

This is my NLP project including many sub-projects,using Python.


Text classification and keyword extraction based on abstracts

relative link: Text classification and keyword extraction based on abstracts

This is my first NLP project,not perfect but interesting.

  • note is my markdown.

  • baseline1 is the traditional baseline of the project,running on the Baidu AI Studio(relative link),and this is the local version.

  • NLP_baseline is a series of baseline,transmitting different classifiers including the Logistic Regression,the Support Vector Machine and the Random Forest Classifier. Based on the classifiers above,fine-tune the parameters with parameter_tuning.py baseline_tuning.py.

    According to the score given by the platform,the fine-tuned Logistic Regression model(AKA fine-tuned baseline) performs best up to now,reaching 0.99401.

    The official provides another dataset: testB.csv on 24th,July. The dataset remove the column Keywords. Thus, I update baseline2 into baseline3 to fix the dataset

  • NLP_upper is the upper project,using the BERT model from transformers to solve the classify-problem.

    Regretfully, my local environment couldn't support the project(my poor GTX1650 4GB).

    SOLUTION: Run the project on Ali Cloud(not success yet)<---It's still a good solution

    However,this project has run for 26 epochs before I stopped the interpreter and the score was unsatisfactory.<---maybe overfitting

    Set the epoch=10,and the model works well,accuracy reaching 0.9850.<---for task 1

    The latest version of NLP_upper is a complete version. It uses the BERT model to solve two tasks compared with only one in last version. The result is quite good but a bit late :).

  • NLP_chatGLM is the project using the LLM,leveraging chatGLM in the case of the stability of the connection. However,using API may casuse the problem that the input including sensitive words stops the program,emphasizing the essence of training the LLM locally.


ChatGPT-generated Text Tester

relative link: ChatGPT-generated Text Tester

This is a program that identifies whether the content is generated by GPT.

  • note is my markdown

  • baseline is the baseline of this sub-project, it has an average level, using the Logistic Regression.

  • upper is the upper project,using the TF-IDF to classify the contents

  • bert is another solution using the BERT model and it's the best model up to now

  • chatGLM_api is a failed project,but it's not meaningless.

    For one thing, the LLM performs well in classifying; for another thing, using the api is not a good idea. From my point of view, the solution is to build the training set and to fine-tune the LLM using the GPU.

  • ernie performs best. I use the Ernie model and Paddle environment. The project is run on the AI Studio. Set the epochs=100 and run all cells

To be continued...

About

This is my NLP project including two sub-projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published