Data Analysis of Lagou Job
This repository holds the code for job data analysis of Lagou. The main functions included are listed as follows:
- Crawling job data from Lagou, and get the latest information of jobs about Internet.
- Proxies are collected from XiCiDaiLi.
- Data analysis and visualization.
- Crawling job details info and generate word cloud as Job Impression.
- In order to train a NLP task with machine learning, the data of interviewee's comments will be stored in mongodb
-
Install 3rd party libraries
sudo pip3 install -r requirements.txt
-
Install mongodb and start mongodb service [optional]
sudo service mongod start
- clone this project from github.
- Lagou's anti-spider strategy has been upgrade frequently recently. I suggest you run proxy_crawler.py to get IP proxies and execute the code with PhantomJS.
- run m_lagou_spider.py to crawl job data, it will generate a collection of Excel files in
./data
directory. - run hot_words_generator.py to cut sentences, it will return TOP-30 hot words and wordcloud figure.
- For technical details, please refer to my answer at Zhihu.
- The PDF report can be downloaded from here.
- [V2.0] - 2019.04. Upgraded to PhantomJS and IP proxies.
- [V1.2] - 2017.05. Rewrite WordCloud visualization module.
- [V1.0] - 2017.04. Upgraded to mobile Lagou.
- [V0.8] - 2016.05. Finish Lagou PC web spider.