Skip to content
This repository has been archived by the owner on Sep 8, 2020. It is now read-only.
/ 7305CloudProject Public archive

COMP7305 course project, Realtime Twitter Stream Analysis System

Notifications You must be signed in to change notification settings

BZbyr/7305CloudProject

Repository files navigation

CloudProject

Term Project for COMP7305 Cluster and Cloud Computing.

Environment

Title : Realtime Twitter Stream Analysis System

Developed By:

Project Structure

.
├── CloudWeb
├── Collector
├── HBaser
├── StreamProcessorFlink
├── StreamProcessorSpark
|
└── pom.xml(Maven parent POM)

  • CloudWeb:
    • Show statistics data and sentiment analysis result.
  • Collector:
    • Collect real-time data by Twitter Python Api.
    • Transform data to Kafka by Flume.
  • StreamProcessorFlink:
    • Analyze statistics from different dimensions.
  • StreamProcessorSpark:
    • Train and generate Naive Bayes Model.
    • Analyze sentiment of Twitter.

Cluster Website

Need to connect with cs vpn.

Spring Boot WebSite

Ganglia Cluster Monitor

Namenode INFO

Hadoop Application

Hadoop JobHistory

Spark History Server

Flink Server

Project Documents

Proposal

Meeting Record

Prensentation PPT

地理查询 API

环境信息

Related Project

Spark-MLlib-Twitter-Sentiment-Analysis

flume_kafka_spark_solr_hive

corenlp-scala-examples

deeplearning4j

canvas-barrage

Data

train data

数据走向:

Flume-> Kafka -> Spark Streaming -> Kafka  
        
              -> Flink -> Kafka
              
              -> DL4J -> Kafka

FlumeTwitter Data 搬运存储到 topic : alex1 供 Spark & Flink & DL4J 订阅。

Spark Streaming 读取topic : alex1 进行情感分析,存储结果数据到 topic : twitter-result1 供Web端订阅。

Cloud Web DL4J 读取topic : alex1 进行 dl4j 情感分析,结果数据不存储,直接吐到WebSocket监听的路由里,供Web端订阅。

Flink 读取topic : alex1 进行数据统计分析,

  • twitter 语言统计结果存储到topic : twitter-flink-lang 供Web端订阅。
  • twitter 用户fans统计结果存储到topic : twitter-flink-fans 供Web端订阅。
  • twitter 用户geo统计结果存储到topic : twitter-flink-geo 供Web端订阅。

数据格式:

  • Twitter 元数据 twitter4j.Status
  • 情感分析结果 ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
  • Lang 统计结果 {"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
  • Fans 统计结果 200|800|500~1000|above 1000
  • map 统计结果 Latitude|Longitude|time

Operation & Conf

  • Kafka Operation

  • Flume Conf

  • HDFS 配置项

    • Naive Bayes 模型路径 /tweets_sentiment/NBModel/

    • Naive Bayes 训练/测试文件路径 /data/training.1600000.processed.noemoticon.csv

    • Stanford Core NLP 模型路径,maven 依赖中 stanford-corenlp-models

    • Deep Learning 模型&词向量路径 /tweets_sentiment/dl4j/

Precondition

  1. Make preparation for Kafka environment. We have set up kafka on total 9 machines. Create kafka topic and make sure we can consume and produce the topic.

  2. Make preparation for Flume environment. We have set up flume GPU7.

  3. Generate Model for Machine Learning lib, run the command and put the model on the HDFS.

spark-submit --class "hk.hku.spark.mllib.SparkNaiveBayesModelCreator" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
  1. The Model for CoreNLP has been stored in NLP-jars, so we only need to make sure the maven dependency is completed.

  2. The Model for DL4J has been stored in the project, so we only need to make sure the word vectors (stored on HDSF) are completed.

  3. The CloudWeb Spring Boot project includes DL4J and this makes it need about 6G memory. Keep mind.

  4. Log in the GPU machine, cd /opt/spark-twitter/7305CloudProject, git pull code and mvn compile project.

Run

  1. Start Flume to collect twitter data and transport into Kafka.
# read boot_flume_sh
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &

# make sure data has been produced into kafka topic successfully

We can start 3 or more flume progress to test the cluster performance.

  1. Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.
单机模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

集群模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar

Watch the Spark Status on the Spark History Server website.

  1. Start CloudWeb to show the result on the website.
cd /opt/spark-twitter/7305CloudProject/CloudWeb/target
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar & 

Watch the Flink Status on the Flink History Server website.

  1. Start Flink
flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar

Common Issues and Solution

  1. Modify Zsh Environment
# 使用zsh, 自定义环境变量需要修改:
vi ~/sh/env_zsh 
# then restart the shell
  1. CloudWeb start failed

The most possible reason is the machine's memory isn't enough. Use command top to watch the status.

About

COMP7305 course project, Realtime Twitter Stream Analysis System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published