How to set up a Spark multi-node cluster on AWS-EC2

Context
Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC.  Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this, which explain  how to Spark multi-node cluster.  But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process.

Disclaimer
Most of the content of this document is from here with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons).

Steps
1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the ports (7077, 8080, etc.) are opened (refer to SPARK_WORKER_PORT or here).

2. Add entries in hosts file (in master node)

>  sudo nano /etc/hosts 
172.31.38.51 master
172.31.42.210 slave01
172.31.32.212 slave02
172.31.39.214 slave03
172.31.39.135 slave04
172.31.36.40 slave05

3. Install Java 8 (in master node)

 sudo yum remove java7; cd /opt/; sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz"; sudo tar xzf jdk-8u131-linux-x64.tar.gz; cd /opt/jdk1.8.0_131/; sudo alternatives --install /usr/bin/java java /opt/jdk1.8.0_131/bin/java 2; sudo alternatives --config java; sudo alternatives --install /usr/bin/jar jar /opt/jdk1.8.0_131/bin/jar 2; sudo alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_131/bin/javac 2; sudo alternatives --set jar /opt/jdk1.8.0_131/bin/jar; sudo alternatives --set javac /opt/jdk1.8.0_131/bin/javac; cd ~

4. Install Scala  (in master node)

wget http://www.scala-lang.org/files/archive/scala-2.12.0-M5.rpm; sudo rpm -ihv scala-2.12.0-M5.rpm; scala -version

5. Configure passwordless SSH

5.1 Generate KeyPair in the master node (in master node)
ssh-keygen -t rsa -P "" 
This will save the key in $HOME/.ssh/id_rsa.pub.

5.2 Append the content  of  $HOME/.ssh/id_rsa.pub to $HOME/.ssh/authorized_keys of both master as well as worker nodes.
Verify passwordless ssh by  (in master node)
ssh slave01
ssh slave02 
ssh slave03 
ssh slave04 
ssh slave05

6. Install Spark

6.1 Download and Extract Spark  (in master node)

cd ~; wget https://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.6.tgz; tar xzf spark-2.0.0-bin-hadoop2.6.tgz;

6.2 Edit environment variables  (in master node)
Append these lines to .bashrc
export JAVA_HOME=/opt/jdk1.8.0_131/
export SPARK_HOME=/home/ec2-user/spark-2.0.0-bin-hadoop2.6/
export PATH=$PATH:$SPARK_HOME/bin
Reload the terminal with updated .bashrc file (in master node)
source ~/.bashrc

6.3 Edit spark-env.sh (in master node)
cp $SPARK_HOME/conf/spark-env.sh.template  $SPARK_HOME/conf/spark-env.sh 

6.4 Add the following lines to $SPARK_HOME/conf/spark-env.sh (in master node)
export JAVA_HOME=/opt/jdk1.8.0_131/
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=1800M
export SPARK_WORKER_INSTANCES=2

SPARK_EXECUTOR_CORES defines the number of cores to use on each executor and SPARK_EXECUTOR_MEMORY defines the memory to use on each executor.  SPARK_WORKER_INSTANCES defines the number of workers on each instance.
Note that these settings is specific to cluster and changes with cluster.

6.5 Add the slaves by creating a file called 'slaves' in $SPARK_HOME/conf  (in master node)

slave01
slave02
slave03
slave04
slave05

7. Setup for slave nodes

7.1. Add Entries in hosts file in every slave node (refer to step 2)  (in slave nodes)

7.2 Install Java 8 (refer to step 3) but, in slave nodes instead of master node

7.3 Install Scala (refer to step 4) but, in slave nodes instead of master node

7.4 Create tarball of configured setup (in master node)

 cd ~; tar czf spark.tar.gz spark-2.0.0-bin-hadoop2.6 

7.5 Copy the configured tarball on all the slaves  (in master node)

scp spark.tar.gz slave01:~;  scp spark.tar.gz slave02:~;  scp spark.tar.gz slave03:~;  scp spark.tar.gz slave04:~ ;  scp spark.tar.gz slave05:~  

7.6 Un-tar configured spark setup on all the slaves (in slave nodes)
tar xzf spark.tar.gz 

8. Start Spark Cluster

8.1 Start the Spark cluster  (in master node)
$SPARK_HOME/sbin/start-all.sh

8.2 To check the setup visit the spark master ui at http://master-node-public-ip:8080

8.3 Application UI is available at http://master-node-public-ip:4040

9. Run a Pyspark program
We use a this program, which generates 1,000,000 records of 100 characters (key, value) and sorts the records based on key. The program can be run using

spark-submit --master spark://172.31.38.51:7077 core_test.py -a

Comments

Popular posts from this blog

PySpark Error - EOFError

How to use REST based calls to submit spark jobs