Apache Storm – A More Real Streaming Engine than Apache Spark Streaming

Apache Storm Vs Apache Spark Streaming:

Apache Storm – real time up to a sub-second level and is event based
Apache Spark Streaming – real time only up to a second level and is micro-batch processing based.

Apache Storm Architecture and terminologies:

Its used very specific terminologies – spout and bolt
Spout is the stream receiver and bolt is the stream processor.
Exposes 2 kinds of APIs – Storm Core and Trident and works very nicely with Kafka.
Unlike Spark Streaming it provides something called “tumbling window”(they don’t overlap and e.g. if we have 3 separate windows of 5s then they are 3 different 5 sec windows ) which is unlike “sliding window”(sliding windows can overlap)

Lets play with Apache Storm:

Apache Storm comes pre-installed on Hortonworks so lets start Storm and Kafka.
Now lets login to the Ambari Console.
This files take a stream of sentences and counts the word by updating a map.
location of Apache Storm client : cd /usr/hdp/current/storm-client
Sample Examples location – cd contrib/storm-starter/src/jvm/org/apache/storm/starter
lets refer to this file – https://testbucket786786.s3.amazonaws.com/WordCountTopology.java
Lets run the file – storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount
lets it run and lets go to the browser to see the Storm UI(rungs on port 8744) – http://127.0.0.1:8744
Now lets see if what dis it process in WordTopology.
Let check the logs now at : cd /usr/hdp/current/storm-client/logs/workers-artifacts
In the exercise the data was dumped to the log. We could have even written this to HDFS or some other source