GitHub - abajwa-hw/hdp21-twitter-demo: 'Hello world' Storm topology to analyze financial tweets

Streaming workshop demo

This demo was part of a technical webinar workshop: "Real Time Monitoring with Hadoop" and "Search Workshop"

Slides and webinar recording are available at http://hortonworks.com/partners/learn/

The updated version of this demo that works with HDP 2.2 sandbox is available here

Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Ingest: Listen for Twitter streams related to S&P 500 companies
Processing:
- Monitor tweets for unexpected volume
- Volume thresholds managed in HBASE
Persistence:
- HDFS (for future batch processing)
- Hive (for interactive query)
- HBase (for realtime alerts)
- Solr/Banana (for search and reports/dashboards)
Refine:
- Update threshold values based on historical analysis of tweet volumes
Demo setup:
- Either download and start prebuilt VM
- Start HDP 2.1 sandbox and run provided scripts to setup demo

Prebuilt VM setup option

There is a prebuilt VM available with the Twitter streaming/search demo ready which can be downloaded from here

Download the ova and import into VM Ware Fusion and start the VM
Make sure HDFS, Hive, HBase, Storm are up via Ambari
Run below to start Solr and sumbit Storm topology to listen for Tweets
```
~/hdp21-twitter-demo/start_demo.sh
```
Run below to start the Kafka producer that generates the Tweets using Twitter4J
```
~/hdp21-twitter-demo/kafkaproducer/runkafkaproducer.sh
```
Observe results following steps here

Setup demo manually option

These setup steps are only needed first time

Download HDP 2.1 sandbox VM image (Hortonworks_Sandbox_2.1.ova) from Hortonworks website
Import Hortonworks_Sandbox_2.1.ova into VMWare and configure its memory size to be at least 8GB RAM
Find the IP address of the VM and add an entry into your machines hosts file e.g.

192.168.191.241 sandbox.hortonworks.com sandbox

Connect to the VM via SSH (password hadoop)

ssh root@sandbox.hortonworks.com

Pull latest code/scripts

git clone https://github.com/abajwa-hw/hdp21-twitter-demo.git

This starts Ambari/HBase and installs maven, kafka, solr, banana, phoenix-may take 10 min

/root/hdp21-twitter-demo/setup-demo.sh
source ~/.bashrc

Open Ambari using admin/admin and make below changes under HBase>config and then restart HBase http://sandbox.hortonworks.com:8080

zookeeper.znode.parent=/hbase
hbase.regionserver.wal.codec=org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec

Start storm via Ambari
Twitter4J requires you to have a Twitter account and obtain developer keys by registering an "app". Create a Twitter account and app and get your consumer key/token and access keys/tokens: https://apps.twitter.com > # > create new app > fill anything > create access tokens
Then enter the 4 values into the file below in the sandbox

vi /root/hdp21-twitter-demo/kafkaproducer/twitter4j.properties
oauth.consumerKey=
oauth.consumerSecret=
oauth.accessToken=
oauth.accessTokenSecret=

Kafka basics - (optional)

#check if kafka already started
ps -ef | grep kafka

#if not, start kafka
nohup /opt/kafka/latest/bin/kafka-server-start.sh /opt/kafka/latest/config/server.properties &

#create topic
/opt/kafka/latest/bin/kafka-topics.sh --create --zookeeper sandbox.hortonworks.com:2181 --replication-factor 1 --partitions 1 --topic test

#list topic
/opt/kafka/latest/bin/kafka-topics.sh --zookeeper sandbox.hortonworks.com:2181 --list | grep test

#start a producer and enter text on few lines
/opt/kafka/latest/bin/kafka-console-producer.sh --broker-list sandbox.hortonworks.com:9092 --topic test

#start a consumer in a new terminal your text appears in the consumer
/opt/kafka/latest/bin/kafka-console-consumer.sh --zookeeper sandbox.hortonworks.com:2181 --topic test --from-beginning

#delete topic
/opt/kafka/latest/bin/kafka-run-class.sh kafka.admin.DeleteTopicCommand --zookeeper sandbox.hortonworks.com:2181 --topic test

Run Twitter demo

Review the list of stock symbols whose Twitter mentiones we will be tracking http://en.wikipedia.org/wiki/List_of_S%26P_500_companies
Generate securities csv from above page and review the securities.csv generated. The last field is the generated tweet volume threshold

/root/hdp21-twitter-demo/fetchSecuritiesList/rungeneratecsv.sh
cat /root/hdp21-twitter-demo/fetchSecuritiesList/securities.csv

Optional step for future runs: can add other stocks/trending topics to csv to speed up tweets (no trailing spaces). Find these at http://mobile.twitter.com/trends

sed -i '1i$HDP,Hortonworks,Technology,Technology,Santa Clara CA,0000000001,5' /root/hdp21-twitter-demo/fetchSecuritiesList/securities.csv
sed -i '1i#mtvstars,MTV Stars,Entertainment,Entertainment,Hollywood CA,0000000001,40' /root/hdp21-twitter-demo/fetchSecuritiesList/securities.csv

Open connection to HBase via Phoenix and check you can list tables

/root/phoenix-4.1.0-bin/hadoop2/bin/sqlline.py  sandbox.hortonworks.com:2181:/hbase
!tables
!q

Make sure HBase is up via Ambari and create Hbase table using csv data with placeholder tweet volume thresholds

/root/hdp21-twitter-demo/fetchSecuritiesList/runcreatehbasetables.sh

notice securities data was imported and alerts table is empty

/root/phoenix-4.1.0-bin/hadoop2/bin/sqlline.py  sandbox.hortonworks.com:2181:/hbase
select * from securities;
select * from alerts;
!q

create Hive table where we will store the tweets for later analysis

hive -f /root/hdp21-twitter-demo/stormtwitter-mvn/twitter.sql

Ensure Storm is started and then start storm topology to generate alerts into an HBase table for stocks whose tweet volume is higher than threshold this will also read tweets into Hive/HDFS/local disk/Solr/Banana. The first time you run below, maven will take 15min to download dependent jars

cd /root/hdp21-twitter-demo/stormtwitter-mvn
./runtopology.sh

Other modes the topology could be started in future runs if you want to clean the setup or run locally (not on the storm running on the sandbox)

./runtopology.sh runOnCluster clean
./runtopology.sh runLocally skipclean

open storm UI and confirm topology was created http://sandbox.hortonworks.com:8744/
In a new terminal, compile and run kafka producer to generate tweets containing first 400 stock symbols values from csv

/root/hdp21-twitter-demo/kafkaproducer/runkafkaproducer.sh

Observe results

Open storm UI and drill into it to view statistics for each Bolt, 'Acked' columns should start increasing http://sandbox.hortonworks.com:8744/
Open HDFS via Hue and see the tweets getting stored (note not all tweets have long/lat): http://sandbox.hortonworks.com:8000/filebrowser/#/tweets/staging

Open Hive table via Hue. Notice tweets are being streamed to Hive table that was created: http://sandbox.hortonworks.com:8000/beeswax/table/default/tweets_text_partition

Open Banana UI and view/search tweet summary and alerts: http://sandbox.hortonworks.com:8983/banana

For more details on the Banana dashboard panels are built, refer to the underlying json file that defines all the panels
In case you don't see any tweets, try changing to a different timeframe on timeline (e.g. by clicking 24 hours, 7 days etc). If there is a time mismatch between the VM and your machine, the tweets may appear at a different place on the timeline than expected.
Run a query in Solr to look at tweets/hashtags/alerts. Click on 'Query' and enter a query under 'q'. Examples are doctype_s:tweet and text_t:AAPL. You can choose an output format under 'wt' and click 'Execute Query'. http://sandbox.hortonworks.com:8983/solr/#/tweets

You can also search using Solr's APIs. The below displays all alerts in JSON format http://sandbox.hortonworks.com:8983/solr/tweets/select?q=*%3A*&df=id&wt=json&fq=doctype_s:alert
- The SolrBolt code showing how the data gets into Solr is shown here
Open connection to HBase via Phoenix and notice alerts were generated

/root/phoenix-4.1.0-bin/hadoop2/bin/sqlline.py  sandbox.hortonworks.com:2181:/hbase
select * from alerts

Notice tweets written to sandbox filesystem via FileSystem bolt

vi /tmp/Tweets.xls

To stop collecting tweets:

kill the storm topology to stop processing tweets

storm kill Twittertopology

To stop producing tweets, press Control-C in the terminal you ran runkafkaproducer.sh

Import data to BI Tool via ODBC for analysis - optional

Create ORC table and copy the tweets over:

hive -f /root/hdp21-twitter-demo/stormtwitter-mvn/createORC.sql

View the contents of the ORC table created: http://sandbox.hortonworks.com:8000/beeswax/table/default/tweets_orc_partition_single

Grant select access to user hive to the ORC table

hive -e 'grant SELECT on table tweets_orc_partition_single to user hive'

On windows VM create an ODBC connector called sandbox with below settings:

	Host=<IP address of sandbox VM>
	port=10000 
	database=default 
	Hive Server type=Hive Server 2 
	Mechanism=User Name 
	UserName=hive

Import data from tweets_orc_partition_single table into Excel over ODBC Data > From other Datasources > From dataconnection wizard > ODBC DSN > sandbox > tweets_orc_partition_single > Finish > Yes > OK

Create some visualizations using PowerCharts (e.g. plotting tweets with coordinates on map)

Other things to try: Analyze any kind of tweet - optional

Instead of filtering on tweets from certain stocks/hashtags, you can also consume all tweets returned by TwitterStream API and re-run runkafkaproducer.sh Note that in this mode a large volume of tweets is generated so you should stop the kafka producer after 20-30s to avoid overloading the system It also may take a few minutes after stopping the kafka producer before all the tweets show up in Banana/Hive

mv /root/hdp21-twitter-demo/fetchSecuritiesList/securities.csv /root/hdp21-twitter-demo/fetchSecuritiesList/securities.csv.bak

To filter tweets based on geography open below file and uncomment this line and re-run runkafkaproducer.sh

vi /root/hdp21-twitter-demo/kafkaproducer/TestProducer.java
/root/hdp21-twitter-demo/kafkaproducer/runkafkaproducer.sh

Reset demo

This empties out the demo related HDFS folders, Hive table, Solr core, Banana webapp and stops the storm topoogy

/root/hdp21-twitter-demo/reset-demo.sh

If kafka keeps sending your topology old tweets, you can also clear kafka queue

zookeeper-client
rmr /group1

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
fetchSecuritiesList		fetchSecuritiesList
kafkaproducer		kafkaproducer
screenshots		screenshots
setup-scripts		setup-scripts
stormtwitter-mvn		stormtwitter-mvn
.gitignore		.gitignore
README.md		README.md
default.json		default.json
reset-demo.sh		reset-demo.sh
setup-demo.sh		setup-demo.sh
solrconfig.xml		solrconfig.xml
start-demo.sh		start-demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming workshop demo

Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Contents

Prebuilt VM setup option

Setup demo manually option

Kafka basics - (optional)

Run Twitter demo

Observe results

To stop collecting tweets:

Import data to BI Tool via ODBC for analysis - optional

Other things to try: Analyze any kind of tweet - optional

Reset demo

About

Releases

Packages

Languages

abajwa-hw/hdp21-twitter-demo

Folders and files

Latest commit

History

Repository files navigation

Streaming workshop demo

Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume

Contents

Prebuilt VM setup option

Setup demo manually option

Kafka basics - (optional)

Run Twitter demo

Observe results

To stop collecting tweets:

Import data to BI Tool via ODBC for analysis - optional

Other things to try: Analyze any kind of tweet - optional

Reset demo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages