Find Tweeter Trends using Hashtags
In our last Tikal Fuseday, we split into several groups to create data analytic applications. As in previous Fuseday, the idea was to learn and use lots of technologies, by creating a working “crash application” from scratch, developed in just a few hours of work.
With our team we had six SW developers, plus two Dev-Ops developers, and we set our goal in the morning – “Create a simple application that analyzes Tweeter hashtags”. We analyzed a big file filled with tweets (simulating on-line tweets), by extracting the date and hashtags per each tweet, accumulating the hashtag for a short period of time by intervals (1 sec each interval), and emit the aggregations into analytic storage. Then, yet another web application, provided a display API for examining the results in a bar-chart as the figure bellow
As you can see in the graph above, it can reveal the highest trending hashtags in the last few minutes.
We decided to create two decoupled services: Hashtags processor which does the logic for handling the stream, and a Web application Service which reads the stored results upon user requests. Here is the context view diagram for it:
We chose to use Storm as our real time processor. You can look at the following Storm topology we created:
We created a TweetsFileSpout that reads the tweets from the file. Then we created a tokenizer bolt, extracting date and hashtags for each and every tweet. The next bolt is the aggregator bolt, which is connected back to the tokenizer bolt with hashtag “field grouping”, making sure all identical hashtags are been accumulated in the same bolt “task” (instance). The aggregator bolt accumulates hashtags in a map keyed by tweets intervals, and upon “tick” tuple emitted by Storm (configured 3 sec), it emits all accumulated tags counting to the last bolt - persister bolt. The persister bolt saves all hashtags counting to Redis. Redis is a special key-value DB, that enables putting a few types as the value. Specifically, the key in our Redis database is the time-interval, while the value is Redis's “hash” - mapping each hashtag to its accumulated counting in this interval (see next figure for the data model in Redis):
We created simple SpringBoot application that exposes a single REST API – get last X seconds most trending hashtags. The result is array of JSONs with the last hashtags counting.
This was a HS application implemented with Angular +D3 application. It calls the Hashtag service and display results in a bar chart, showing most popular Hashtags in the last X seconds.
Running the application
After downloading and building both processor and service application, you should do the following steps:
- Run Redis
- Run Processor service. In order to run the application you can run the processor by the following command :
java -jar thashtags-1.0.0-jar-with-dependencies.jar <YOUR TWEETS FILE> <REDIS IP>
- Run Web application Service:
java -jar ./target/demo-0.0.1-SNAPSHOT.jar
- Look on the REST api using http://localhost:8080/lastTweets/seconds/5
- See the displayed graph at http://localhost:8080/index.html
The code for the Hashtag Processor and Service applications were committed to the GitHUB. We enjoyed playing with data analytic and streaming technologies, and were happy to provide very initial, but working application.