Apache NIFI – A Graphical Streaming Tool with Workflow Features

Apache Nifi:

It is a data streaming and transformation tool
It has a nice Web based UI where we can configure the workflow.
To understand the power of Nifi lets play with it directly.

Lets play with Nifi:

Lets stream live twitter feed from the twitter hose.
Prerequisites –
- A twitter developer account – if you don’t have follow these steps:
  - Go to – https://apps.twitter.com
  - Now login using yr existing Twitter account ( if you even don’t have Twitter account then you have to create one)
  - After logging in – click on “Create New App”
  - Once the app is created, go to “Keys and Access Tokens” tab and copy the “Consumer Key (API Key)” and “Consumer Secret (API Secret)”.
  - You will need this for authentication in the NIFI UI in the twitter hose.
Install and Configure Nifi:
- First connect to the Sandbox console using ‘maria_dev’ credentials and elevate your permissions to root.
- create a folder for NIfi – mkdir opt/nifitest
- Download the latest Nifi tar ball from – wget https://archive.apache.org/dist/nifi/1.4.0/nifi-1.4.0-bin.tar.gz
- Extract the file – tar xvfz nifi-1.4.0-bin.tar.gz
- change the folder to – cd nifi-1.4.0
- Update this file – nano conf/nifi.properties
  - Update the following:
    - nifi.web.http.port=8090
- Now start Nifi – ./bin/nifi.sh start
- Now login to the NiFi UI – http://127.0.0.1:8090/nifi or http://hostname:8090/nifi or http://IPAddress:8090/nifi
- There are 100s of templates available to setup and configure a workflow in Nifi. Here is the link – https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
- Lets download the one related to Twitter Hose – https://cwiki.apache.org/confluence/download/attachments/57904847/Pull_from_Twitter_Garden_Hose.xml?version=1&modificationDate=1433234009000&api=v2
- Import the downloaded template into Nifi –
- Verify If Template Upload/Install was sucessful
- Now click and drag the template icon button to the work area as shown in the snapshot below:
- You will see the template like below:
- Click on “Grab Garden Hoze” to select the box first, then right click to open context menu and click ‘Configure’.
- In the ‘Configure Processor’ pop up that opens, enter the following
  - Twitter Endpoint = Filter Endpoint
  - Consumer Key, Consumer Secret, Access Token and Access Token secret from the twitter app you created.
- Now create a directory and give permissions in HDFS where we can save the tweets.
  - hadoop fs -mkdir /user/maria_dev/nifitest
  - hadoop fs -chmod 777 /user/maria_dev/nifitest
- Now come back to Nifi UI and click and drag the processor button as in the snapshot below – and filter HDFS Processor
- Configure the processor and integrate it with workflow to do this, click
  “Find only Tweets” to “PutHDFS” by dragging relationship arrow.
- Now Configure the PutHDFS Processor (right click and “Configure” and then set the directory to – /user/maria_dev/nifitest
- Right click on “matched” queue and configure the following:
- Now select all the components (Ctrl + A) and press the process arrow button as shown below.
- Now lets verify if data was written in HDFS
- Check – hadoop fs -ls /user/maria_dev/nifitest
- Now that the data is pulled in HDFS, lets use any tool to visualize the data.
- Lets use Hive for now –
  - Create an external table pointing to the tweets folder in HDFS:
  - CREATE EXTERNAL TABLE tweetshosedata( tweet_id bigint, created_unixtime bigint, created_time string, lang string, displayname string, time_zone string, message string) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES( ‘field.delim’=’|’, ‘serialization.format’=’|’) STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ LOCATION ‘/user/maria_dev/nifitest’;
  - Now query to view the data
  - select displayname, count(*) as NumberOfTweets
    from tweetshosedata
    group by displayname
    order by NumberOfTweets desc;
  - So we streamed data from Twitter Hose and upload in HDFS and then created an external table in Hive to visualize the data using a Hive Query.