How to set up a Spark multi-node cluster on AWS-EC2
Context
Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC. Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this, which explain how to Spark multi-node cluster. But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process.
Disclaimer
Most of the content of this document is from here with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons).
Steps
1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the ports (7077, 8080, etc.) are opened (refer to SPARK_WORKER_PORT or here).
2. Add entries in hosts file (in master node)
3. Install Java 8 (in master node)
4. Install Scala (in master node)
5. Configure passwordless SSH
5.1 Generate KeyPair in the master node (in master node)
This will save the key in $HOME/.ssh/id_rsa.pub.
5.2 Append the content of $HOME/.ssh/id_rsa.pub to $HOME/.ssh/authorized_keys of both master as well as worker nodes.
Verify passwordless ssh by (in master node)
6. Install Spark
6.1 Download and Extract Spark (in master node)
6.2 Edit environment variables (in master node)
Append these lines to .bashrc
Reload the terminal with updated .bashrc file (in master node)
6.3 Edit spark-env.sh (in master node)
6.4 Add the following lines to $SPARK_HOME/conf/spark-env.sh (in master node)
SPARK_EXECUTOR_CORES defines the number of cores to use on each executor and SPARK_EXECUTOR_MEMORY defines the memory to use on each executor. SPARK_WORKER_INSTANCES defines the number of workers on each instance.
Note that these settings is specific to cluster and changes with cluster.
6.5 Add the slaves by creating a file called 'slaves' in $SPARK_HOME/conf (in master node)
7. Setup for slave nodes
7.1. Add Entries in hosts file in every slave node (refer to step 2) (in slave nodes)
7.2 Install Java 8 (refer to step 3) but, in slave nodes instead of master node
7.3 Install Scala (refer to step 4) but, in slave nodes instead of master node
7.4 Create tarball of configured setup (in master node)
7.5 Copy the configured tarball on all the slaves (in master node)
7.6 Un-tar configured spark setup on all the slaves (in slave nodes)
8. Start Spark Cluster
8.1 Start the Spark cluster (in master node)
8.2 To check the setup visit the spark master ui at http://master-node-public-ip:8080
8.3 Application UI is available at http://master-node-public-ip:4040
9. Run a Pyspark program
We use a this program, which generates 1,000,000 records of 100 characters (key, value) and sorts the records based on key. The program can be run using
Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC. Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this, which explain how to Spark multi-node cluster. But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process.
Disclaimer
Most of the content of this document is from here with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons).
Steps
1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the ports (7077, 8080, etc.) are opened (refer to SPARK_WORKER_PORT or here).
2. Add entries in hosts file (in master node)
> sudo nano /etc/hosts 172.31.38.51 master 172.31.42.210 slave01 172.31.32.212 slave02 172.31.39.214 slave03 172.31.39.135 slave04 172.31.36.40 slave05
3. Install Java 8 (in master node)
sudo yum remove java7; cd /opt/; sudo wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz"; sudo tar xzf jdk-8u131-linux-x64.tar.gz; cd /opt/jdk1.8.0_131/; sudo alternatives --install /usr/bin/java java /opt/jdk1.8.0_131/bin/java 2; sudo alternatives --config java; sudo alternatives --install /usr/bin/jar jar /opt/jdk1.8.0_131/bin/jar 2; sudo alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_131/bin/javac 2; sudo alternatives --set jar /opt/jdk1.8.0_131/bin/jar; sudo alternatives --set javac /opt/jdk1.8.0_131/bin/javac; cd ~
4. Install Scala (in master node)
wget http://www.scala-lang.org/files/archive/scala-2.12.0-M5.rpm; sudo rpm -ihv scala-2.12.0-M5.rpm; scala -version
5. Configure passwordless SSH
5.1 Generate KeyPair in the master node (in master node)
ssh-keygen -t rsa -P ""
5.2 Append the content of $HOME/.ssh/id_rsa.pub to $HOME/.ssh/authorized_keys of both master as well as worker nodes.
Verify passwordless ssh by (in master node)
ssh slave01 ssh slave02 ssh slave03 ssh slave04 ssh slave05
6. Install Spark
6.1 Download and Extract Spark (in master node)
cd ~; wget https://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.6.tgz; tar xzf spark-2.0.0-bin-hadoop2.6.tgz;
6.2 Edit environment variables (in master node)
Append these lines to .bashrc
export JAVA_HOME=/opt/jdk1.8.0_131/ export SPARK_HOME=/home/ec2-user/spark-2.0.0-bin-hadoop2.6/ export PATH=$PATH:$SPARK_HOME/bin
source ~/.bashrc
6.3 Edit spark-env.sh (in master node)
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
6.4 Add the following lines to $SPARK_HOME/conf/spark-env.sh (in master node)
export JAVA_HOME=/opt/jdk1.8.0_131/ export SPARK_EXECUTOR_CORES=1 export SPARK_EXECUTOR_MEMORY=1800M export SPARK_WORKER_INSTANCES=2
SPARK_EXECUTOR_CORES defines the number of cores to use on each executor and SPARK_EXECUTOR_MEMORY defines the memory to use on each executor. SPARK_WORKER_INSTANCES defines the number of workers on each instance.
Note that these settings is specific to cluster and changes with cluster.
6.5 Add the slaves by creating a file called 'slaves' in $SPARK_HOME/conf (in master node)
slave01 slave02 slave03 slave04 slave05
7. Setup for slave nodes
7.1. Add Entries in hosts file in every slave node (refer to step 2) (in slave nodes)
7.2 Install Java 8 (refer to step 3) but, in slave nodes instead of master node
7.3 Install Scala (refer to step 4) but, in slave nodes instead of master node
7.4 Create tarball of configured setup (in master node)
cd ~; tar czf spark.tar.gz spark-2.0.0-bin-hadoop2.6
7.5 Copy the configured tarball on all the slaves (in master node)
scp spark.tar.gz slave01:~; scp spark.tar.gz slave02:~; scp spark.tar.gz slave03:~; scp spark.tar.gz slave04:~ ; scp spark.tar.gz slave05:~
7.6 Un-tar configured spark setup on all the slaves (in slave nodes)
tar xzf spark.tar.gz
8. Start Spark Cluster
8.1 Start the Spark cluster (in master node)
$SPARK_HOME/sbin/start-all.sh
8.2 To check the setup visit the spark master ui at http://master-node-public-ip:8080
8.3 Application UI is available at http://master-node-public-ip:4040
9. Run a Pyspark program
We use a this program, which generates 1,000,000 records of 100 characters (key, value) and sorts the records based on key. The program can be run using
spark-submit --master spark://172.31.38.51:7077 core_test.py -a
Comments
Post a Comment