CloudProject

Term Project for COMP7305 Cluster and Cloud Computing.

Environment

Title : Realtime Twitter Stream Analysis System

Developed By:

Project Structure

.
├── CloudWeb
├── Collector
├── HBaser
├── StreamProcessorFlink
├── StreamProcessorSpark
|
└── pom.xml(Maven parent POM)

CloudWeb:
- Show statistics data and sentiment analysis result.
Collector:
- Collect real-time data by Twitter Python Api.
- Transform data to Kafka by Flume.
StreamProcessorFlink:
- Analyze statistics from different dimensions.
StreamProcessorSpark:
- Train and generate Naive Bayes Model.
- Analyze sentiment of Twitter.

Cluster Website

Need to connect with cs vpn.

Spring Boot WebSite

Ganglia Cluster Monitor

数据走向:

Flume-> Kafka -> Spark Streaming -> Kafka  
        
              -> Flink -> Kafka
              
              -> DL4J -> Kafka

Flume 将Twitter Data 搬运存储到 topic : alex1 供 Spark & Flink & DL4J 订阅。

Spark Streaming 读取topic : alex1 进行情感分析，存储结果数据到 topic : twitter-result1 供Web端订阅。

Cloud Web DL4J 读取topic : alex1 进行 dl4j 情感分析，结果数据不存储，直接吐到WebSocket监听的路由里，供Web端订阅。

Flink 读取topic : alex1 进行数据统计分析，

twitter 语言统计结果存储到topic : twitter-flink-lang 供Web端订阅。
twitter 用户fans统计结果存储到topic : twitter-flink-fans 供Web端订阅。
twitter 用户geo统计结果存储到topic : twitter-flink-geo 供Web端订阅。

数据格式:

Twitter 元数据 twitter4j.Status
情感分析结果 ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
Lang 统计结果 {"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
Fans 统计结果 200|800|500~1000|above 1000
map 统计结果 Latitude|Longitude|time

Operation & Conf

Kafka Operation
Flume Conf
HDFS 配置项
- Naive Bayes 模型路径 /tweets_sentiment/NBModel/
- Naive Bayes 训练/测试文件路径 /data/training.1600000.processed.noemoticon.csv
- Stanford Core NLP 模型路径，maven 依赖中 stanford-corenlp-models
- Deep Learning 模型&词向量路径 /tweets_sentiment/dl4j/

Precondition

Make preparation for Kafka environment. We have set up kafka on total 9 machines. Create kafka topic and make sure we can consume and produce the topic.
Make preparation for Flume environment. We have set up flume GPU7.
Generate Model for Machine Learning lib, run the command and put the model on the HDFS.

spark-submit --class "hk.hku.spark.mllib.SparkNaiveBayesModelCreator" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

The Model for CoreNLP has been stored in NLP-jars, so we only need to make sure the maven dependency is completed.
The Model for DL4J has been stored in the project, so we only need to make sure the word vectors (stored on HDSF) are completed.
The CloudWeb Spring Boot project includes DL4J and this makes it need about 6G memory. Keep mind.
Log in the GPU machine, cd /opt/spark-twitter/7305CloudProject, git pull code and mvn compile project.

Run

Start Flume to collect twitter data and transport into Kafka.

# read boot_flume_sh
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &

# make sure data has been produced into kafka topic successfully

We can start 3 or more flume progress to test the cluster performance.

Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.

单机模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

集群模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

Watch the Spark Status on the Spark History Server website.

Start CloudWeb to show the result on the website.

cd /opt/spark-twitter/7305CloudProject/CloudWeb/target
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar &

Watch the Flink Status on the Flink History Server website.

Start Flink

flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar

Common Issues and Solution

Modify Zsh Environment

# 使用zsh, 自定义环境变量需要修改:
vi ~/sh/env_zsh 
# then restart the shell

CloudWeb start failed

The most possible reason is the machine's memory isn't enough. Use command top to watch the status.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CloudProject

Environment

Title : Realtime Twitter Stream Analysis System

Project Structure

Cluster Website

Project Documents

Related Project

Data

Operation & Conf

Precondition

Run

Common Issues and Solution

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
.vscode		.vscode
CloudWeb		CloudWeb
Collector		Collector
HBaser		HBaser
StreamProcessorFlink		StreamProcessorFlink
StreamProcessorSpark		StreamProcessorSpark
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
pom.xml		pom.xml

BZbyr/7305CloudProject

Folders and files

Latest commit

History

Repository files navigation

CloudProject

Environment

Title : Realtime Twitter Stream Analysis System

Project Structure

Cluster Website

Project Documents

Related Project

Data

Operation & Conf

Precondition

Run

Common Issues and Solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages