Big Data

Apache Storm – A More Real Streaming Engine than Apache Spark Streaming

Laptop showing charts and KPIs
Photo: Carlos Muza / Unsplash · Royalty-free

Apache Storm Vs Apache Spark Streaming:

  • Apache Storm – real time up to a sub-second level and is event based
  • Apache Spark Streaming – real time only up to a second level and is micro-batch processing based.

Apache Storm Architecture and terminologies:

  • Its used very specific terminologies – spout and bolt
  • Spout is the stream receiver and bolt is the stream processor.
  • exIhw
  • Exposes 2 kinds of APIs – Storm Core and Trident and works very nicely with Kafka.
  • Unlike Spark Streaming it provides something called “tumbling window”(they don’t overlap and e.g. if we have 3 separate windows of 5s then they are 3 different 5 sec windows ) which is unlike “sliding window”(sliding windows can overlap)

Lets play with Apache Storm:

  • Apache Storm comes pre-installed on Hortonworks so lets start Storm and Kafka.
  • exIhw
  • Now lets login to the Ambari Console.
  • This files take a stream of sentences and counts the word by updating a map.
  • location of Apache Storm client : cd /usr/hdp/current/storm-client
  • Sample Examples location – cd contrib/storm-starter/src/jvm/org/apache/storm/starter
  • lets refer to this file – https://testbucket786786.s3.amazonaws.com/WordCountTopology.java
  • Lets run the file – storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount
  • exIhw
  • lets it run and lets go to the browser to see the Storm UI(rungs on port 8744) – http://127.0.0.1:8744
  • Capture
  • Now lets see if what dis it process in WordTopology.
  • Capture
  • Let check the logs now at : cd /usr/hdp/current/storm-client/logs/workers-artifacts
  • Capture
  • In the exercise the data was dumped to the log. We could have even written this to HDFS or some other source