Install Spark on Linux or Windows as Standalone Setup without Hadoop Ecosystem

Windows

Install JDK (Java Development Kit)
- Visit Java site – http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
  - Select your environment ( Windows x86 or x64)
  - Accept license and download it
  - Install JDK, but make sure your installation folder should not have spaces in path name e.g d:\jdk8
  - The same with the JRE folder –
Install Spark:
- Visit Spark site – https://spark.apache.org/downloads.html
- Lets select Spark version 2.3.0 and click on the download link
- Now lets unzip the tar file using WinRar or 7Z and copy the content of the unzipped folder to a new folder D:\Spark
- Rename file conf\log4j.properties.template file to log4j.properties
- Edit the file to change log level to ERROR – for log4j.rootCategory
Download WinUtils :
- Its an artifacts for mimicking Hadoop
- Visit GitHub at – https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe
- Copy this file to D:\WinUtils\bin
- execute command – winutils.exe chmod 777 \tmp\hive from that folder
Setup Environmental variables:
- Right-click Windows menu –> select Control Panel –> System and Security –> System –> Advanced System Settings –> Environment Variables
  - Add these USER variables:
    - SPARK_HOME as D:\SPARK
    - JAVA_HOME as D\JDK8
    - HADOOP_HOME as D:\WINUTILS
  - Append the below to PATH variable:
    - %SPARK_HOME%\bin
    - %JAVA_HOME%\bin
Install Enthought Canopy:
- Visit Enthought canopy site at – https://store.enthought.com/downloads/#default
- Select environment for Windows(32 bit or 64 bit) and download 3.5 version canopy and install.
Now lets test and play:
- Open Enthought Canopy
- Tools –> Canopy Command Prompt.
- Go to D:\spark folder
- Look for README.md or CHANGES.txt in that folder
- Type and Enter pyspark
- On this “>>>” prompt.
  - Type and Enter myRDD= sc.textFile(“README.md”)
  - Then Type and enter myRDD.count()
- If you get successful count then you succeeded in installing Spark with Python on Windows
- Type and Enter quit() to exit the spark.

Linux

Install JDK (Java Development Kit)
- To install JRE8- yum install -y java-1.8.0-openjdk
- To install JDK8- yum install -y java-1.8.0-openjdk-devel
- execute – javac -version
  - It should return a version as 1.8
Install Python
- To install Python :
- sudo yum -y install yum-utils
  sudo yum -y groupinstall development
  sudo yum -y install https://centos7.iuscommunity.org/ius-release.rpm
  sudo yum -y install python36u
- Setup alias for python command and update the ~/.bashrc
  - echo “alias python=python36” >> ~/.bashrc
  - source ~/.bashrc
- execute – python -version
  - It should return a version as 3.6
- Install pip –
  - curl “https://bootstrap.pypa.io/get-pip.py” -o “get-pip.py”
    python get-pip.py
    pip -V
Install Spark:
- First move to opt folder – cd /opt
- Now download proper version of Spark(First go to https://spark.apache.org/downloads.html and then copy the link address) – wget https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
- Unzip the tar – tar xvfz spark-2.3.0-bin-hadoop2.7.tgz
- Rename spark-2.3.0-bin-hadoop2.7 to spark – mv spark-2.3.0-bin-hadoop2.7 spark
- Rename file conf\log4j.properties.template file to log4j.properties
- Edit the file to change log level to ERROR – for log4j.rootCategory
- Install PySpark – pip install pyspark
Install Scala:
- cd /opt
- wget https://scala-lang.org/files/archive/scala-2.11.6.tgz
- Unzip and rename –
  - tar -xfz scala-2.11.6.tgz
  - mv scala-2.11 scala
- execute – scala -version
  - It should return a version as 2.11
Update PATHS by updating file ~/.bashrc:
- nano ~/.bashrc
- then add these lines and save
  - alias python=python3.6
    alias pip=pip3 export SPARK_HOME=/opt/spark
    export PATH=$PATH:/opt/spark/bin
    export PATH=$PATH:/opt/scala/bin
- then reload bash file – source ~/.bashrc
Now if you run
- pyspark – it should show spark version
- spark-shell – it should run scala version
Lets test
- run pyspark
  - go to \opt\spark folder
  - run pyspark
  - On this “>>>” prompt.
    - Type and Enter myRDD= sc.textFile(“README.md”)
    - Then Type and enter myRDD.count()
  - Yay!!!, you tested by running word count on file README.md
- Now One more Test
  - Download Movielens data-set –
    - wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    - unzip ml-100k.zip
  - Download python script:
    - wget https://testbucket786786.s3.amazonaws.com/spark/sparkFirst.py
    - correct the path of the u.data file in ml-100k folder in the script:
    - from pyspark import SparkConf, SparkContext
      import collections
      sConf = SparkConf().setMaster(“local”).setAppName(“RatingsRDDApp”)
      sContext = SparkContext(conf = sConf)
      alllinesRDD = sContext.textFile(“/home/user/bigdata/datasets/ml-100k/u.data”)
      allratingsRDD =alllinesRDD.map(lambda line: line.split()[2])
      resultRDD= allratingsRDD.countByValue()
      sortedResultsRDD = collections.OrderedDict(sorted(resultRDD.items()))
      for rddKey, rddValue in sortedResultsRDD.items():
      print(“%s %i” % (rddKey, rddValue))
  - Now run the python script:
    - spark-submit sparkFirst.py
    - Yay!! you read the ratings count for each movie in Movielens data base using a python script.