Big Data

How to Process Data with Pig with MapReduce and TEZ

Circuit board macro detail
Photo: Alexandre Debiève / Unsplash · Royalty-free
  • As we know that HDFS is the distributed storage system of Hadoop. Similarly MapReduce is the core processing engine.
  • Recently TEZ is also becoming a lot popular as it is much faster than MapReduce. the reason for its speed is its nature of interpreting the relationship between mapping, sorting, shuffling, sorting and it creates a execution plan based on the interpretation.
  • Pig sits on the top of either MapReduce or TEZ and uses very nicely structured language called as ‘Pig Latin’ which uses a syntax quite similar to SQL.
  • Lets try to process these 2 things using Pig on the top of MapReduce/TEZ
    • Top Rated Oldest Movies – those oldies goldies movies  – top rated means 4.0 and above on scale of 5.0
    • Max Rated Worst movies – those movies which are worst rated but  have got maximum ratings by users that they are worst – lol  – Worst rated means 2.0 and below ratings.
  • Okay, so login to Ambari Dashboard and open “Pig View” as below:
  • 019 - Pig dashboard
    Click ‘Pig View’ and then on ‘New script’
  • Now write the following script on the script editor window and Execute once using TEZ checkbox unchecked and then with TEZ checkbox checked and see the difference in speed of execution.
  • Here is the link , you can download the script – https://s3.amazonaws.com/testbucket786786/OldiesButGoldies.txt
  • 020 - Pig Script Oldies but goldies
  • 020a - Pig Script Oldies but goldies process
    Checking “Execute with TEZ will use TEZ else MapReduce
  • 020b - Pig Script Oldies but goldies result
    Result – The Thin Man 1934 is highest rated oldy ….another big data analysis…lol
  • Pig with TEZ is getting hotter these days due to its sheer speed.
  • Now lets interpret the Pig Latin script

# PIG uses relations to process data# PIG uses relations to process data # Lets create a relation ‘ratings’ using the LOAD command to load the file from given path, # Specify column names and data types# By default the delimiter id TAB so we are good ratings = LOAD ‘/user/maria_dev/ml-100k/u.data’  AS (userID:int,             movieID:int,                 rating:int,                 ratingTime:int);
# Lets create another relation ‘metaData’ using the LOAD command to load the file from given path, # Specify column names and data types # This file has a pipe delimiter so we have to use another command to specify the delimiter – USING PigStorage(‘|’)
metaData = LOAD ‘/user/maria_dev/ml-100k/u.item’  USING PigStorage(‘|’)             AS (movieID:int,             movieTitle:chararray,                 releaseDate:chararray,                 videoRelease:chararray,                 imdbLink:chararray); # Now lets do some transformation to make the release date in proper format # use ToUnixTime and ToDate function to converts date strings to dates # To create another relation from an exisitng relation, use command – ForEach relationname GENERATE # Generate command generates a new relation from exisitng one nameLookup = FOREACH metaData  GENERATE movieID,                      movieTitle,                          ToUnixTime(ToDate(releaseDate,’dd-MMM-yyyy’)) AS releaseTime;
#now lets group the movies by movieId ratingsByMovie = GROUP ratings by movieID;
#now lets group the relation by average of the ratings #use AVG command to calculate average # and ForEach relationname GENERATE to generate a new relation avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) AS avgRating;
# now filter (FILTER command) movies with ratings 4.0 and above fiveStarMovies = FILTER avgRatings BY avgRating>4.0;
#now join the filtered relation with the movies metadata relation fiveStarMoviesDetails = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;
#now ORDER by release date –the oldest movies will be on top with ratings 4 and above oldestfiveStarMoviesDetails = ORDER fiveStarMoviesDetails BY nameLookup::releaseTime;
# dump the relation to show the data(in prod world you will write the data into a file) DUMP oldestfiveStarMoviesDetails

  • To find worst rated movies with maximum ratings you have yo change the above code at 2 places – filter woth rating less than equal to 2.0 and order by ratingcount descending
  • Here is the link – https://s3.amazonaws.com/testbucket786786/WorstMovies.txt
  • 020 - Pig Script worst movies
    filter only those movies which are bad(rating 2.0 or below…while grouping add one more column for count of ratings and while ordering use this numofratings descending as the order by column
     
  • 020b - Pig Script Oldies but goldies result final
  • Hurray you have now used PIG latin to get the old good movies as well as worst movies