Big Data - Advanced

Apache Spark – A Deep Dive – Series 6 of N – Analysis a Super Heroes Social Network Graph

Compliance checklist on a desk
Photo: Scott Graham / Unsplash · Royalty-free
  • If you have used social apps like Facebook, Twitter etc, you might have noticed that as users we have a network of friends, then friends of friends etc.
  • There is always a degree of separation between you and self(degree of separation is 0), your direct friends(degree of separation is 1), your friend’s friend(degree of separation is 2)Capture
  • Based on the the degree of separation and the count of friends we can conclude as if who is most popular, who has most friends , who has max friends of friends etc
  • In the example below we will try to find out the “Most Popular Super Hero”.
  • Marvel Network maintains a list of such super heroes and their network.
  • Similar algorithm can be applied to any social network
  • There are 2 source files for the exercise:
  • Lets check out the exercise 1
    • Here is the Python script , download it- https://testbucket786786.s3.amazonaws.com/spark/sparkMostPopularHero.py
    • Here is the source code
      • # First import SparkConf and SparkContext from pyspark module
        from pyspark import SparkConf, SparkContext # Then, set SparkConf by setting up master as local(means stanalone local) and app Name
        sConf = SparkConf().setMaster(“local”).setAppName(“MostPopularHero”) # Then, set SparkContext based on the SparkConf
        sContext = SparkContext(conf = sConf) # python function to return a key value pair to get heroes
        def processHeros(line):
        fields=line.split(‘\”‘)
        heroID=int(fields[0])
        heroName=fields[1].encode(“utf8”)
        return (heroID, heroName) # python function to return a count of occurrences per line for a hero
        def processHeroCounts(line):
        fields=line.split()
        heroID=int(fields[0])
        occuranceCount=len(fields) – 1
        return (heroID, occuranceCount) # read the data file from the marvel heroes file
        heroData = sContext.textFile(“/home/user/bigdata/datasets/Otherdata/marvel-heroes.txt”)
        # map the data to create a key value pair
        herosRdd = heroData.map(processHeros) # read the data file from the marvel networks file
        networkData = sContext.textFile(“/home/user/bigdata/datasets/Otherdata/marvel-network.txt”)
        # map the data to create a key value pair
        networks = networkData.map(processHeroCounts)
        # now reduce by Key to get a sum of all occurrences
        networksByKey = networks.reduceByKey(lambda x, y : x + y)
        # now flip the rdd to make count as key and get max from that
        mostpopularHeroId = networksByKey.map(lambda x, y : y, x).max() mostpopularHeroName = herosRdd.lookup(mostpopularHeroId[1])[0] print(“The most popular hero is %s with %d as number of friends” %(mostpopularHeroName, str(mostpopularHeroId[0]) ))
    • Here is the output
      • execute – spark-submit sparkMostPopularHero.py
      • Capture
  • lets checkout exercise 2 – which just extends exercise 1 as it shows all heros and their occurrence count
    • Download this – https://testbucket786786.s3.amazonaws.com/spark/sparkAllPopularHero.py
    • Here is the code –
    • # First import SparkConf and SparkContext from pyspark module
      from pyspark import SparkConf, SparkContext # Then, set SparkConf by setting up master as local(means stanalone local) and app Name
      sConf = SparkConf().setMaster(“local”).setAppName(“MostPopularHero”) # Then, set SparkContext based on the SparkConf
      sContext = SparkContext(conf = sConf) # python function to return a key value pair got heroes
      def loadHeros():
      heroes={}
      with open(“/home/user/bigdata/datasets/Otherdata/marvel-heroes.txt”) as heroFile:
      for line in heroFile:
      fields = line.split(‘\”‘)
      heroes[int(fields[0])] = fields[1]
      return heroes # python function to return a count of occurrences per line for a hero
      def processHeroCounts(line):
      fields=line.split()
      heroID=int(fields[0])
      occuranceCount=len(fields) – 1
      return (heroID, occuranceCount) # python function to print the RDD
      def printRDD(results):
      for hero in results:
      heroName = str(hero[0])
      occurrenceCount = int(hero[1])
      print(“Hero Name: %s, Occurrence Count: %d” %(heroName, occurrenceCount)) # broadcast the hero dictionary
      heroesDict= sContext.broadcast(loadHeros()) # read the data file from the marvel networks file
      networkData = sContext.textFile(“/home/user/bigdata/datasets/Otherdata/marvel-network.txt”)
      # map the data to create a key value pair
      networks = networkData.map(processHeroCounts)
      # now reduce by Key to get a sum of all occurrences
      networksByKey = networks.reduceByKey(lambda x, y : (x + y)) #lets sort networksByCountAsKey and print all
      networksByKeySorted = networksByKey.map(lambda (x, y) :(heroesDict.value[x], y))
      printRDD(networksByKeySorted.top(25, key= lambda x : x[1]))

    • The output is here –
    • Capture