An introduction to large-scale data processing: Installation Guide of Apache Spark 2.3.1 and Hadoop 3

Given that reporting tools and related technologies evolve constantly, the need to process data close to real time is growing. The evolution of mobile technology, the excessive use of social media applications, the development of location-based services and the increase of daily data production rate, confirm the fact that large-scale data processing is increasingly gaining in importance.

One of the most widely used platforms for big data processing is Apache Hadoop. Shortly explained, Hadoop is a distributed computing framework that consists of two main components, HDFS and MapReduce. HDFS handles the data storage on the multiple machines, whereas MapReduce handles the processing of data across the machines of the cluster. As MapReduce requires a significant duration of time for its execution, it is insufficient for real-time data processing. However, this gap is filled with the introduction of streaming analytics platforms.

As far as the Apache foundation is concerned, Spark is one of the most popular stream analytics platform. This is because Spark is used in both real-time and batch data processing and its latest versions support applications in various languages, such as Java, Scala, Python, R and SQL. Along with Spark, also widely used Apache modules in large-scale and real-time data processing are Storm, Flink, Kafka and Samza, each having its own strengths and weaknesses.

This article focuses on Spark and describes the steps for installing the latest standalone version of Apache Spark at the time being (version: 2.3.1, date released: 08.06.2018) on windows 10 operating system. The installation consists of 4 steps:

Step 1: Download and install JRE

Navigate to Oracle’s official website, download the latest Java Runtime Environment version (JRE) for windows x64 and install it by running the downloaded .exe file. After running the executable for the JRE, the PATH variable is set automatically.

Step 2: Download and install Hadoop 3.0.0

Navigate to Steve Loughran’s github page and download hadoop-3.0.0. Create a new folder at a preferred location in your local machine (as for example C:/hadoop) and paste the downloaded hadoop-3.0.0 folder in it. Add a new environment variable by setting its name and value:

HADOOP_HOME=C:\hadoop\hadoop-3.0.0

After setting the environment variable, create a temp folder in your local machine under c:\tmp\hive for HDFS usage and change the folder permissions. Open windows cmd (run as administrator) and execute the following:

cd c:\hadoop\hadoop-3.0.0\bin
winutils.exe chmod 777 C:\tmp\hive

 

Step 3: Download and install Spark 2.3.1

Navigate to Spark’s official page and download spark 2.3.1. Create a new folder at a preferred location in your local machine (as for example C:/spark). Unzip twice the downloaded file and paste the final extract spark-2.3.1-bin-hadoop2.7 folder in C:/spark. Add a new environment variable by setting its name and value:

SPARK_HOME=C:\somewhere\spark-2.1.0-bin-hadoop2.7

Append the following value into “Path” variable:

%SPARK_HOME%\bin

Step 4: Validate installation

Open the command line and execute the following to check that spark has been set up correctly by using the python spark api, pyspark:

pyspark --version

If the installation is successful, the command responds with a welcome message and initial info on spark installation. Also, after the release of Spark 2.3.1, pyspark is available in pypi. To install pyspark, as with all pip installations, execute:

pip install spark

For getting started with pyspark and its modules (streaming, sql and machine learning), detailed documentation can be found here.

Leave a Reply

Your email address will not be published. Required fields are marked *