Context Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC. Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this , which explain how to Spark multi-node cluster. But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process. Disclaimer Most of the content of this document is from here with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons). Steps 1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the ports (7077, 8080, etc.
How to use Livy: Livy (alpha) enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps which uses a REST calls to communicate with the Spark cluster. Note, during the time of writing this blog, I have only tried Levy on a standalone PySpark setup, so I don't know the challenges involved in setting up Levy in the PySpark Cluster. Since we would be using local file make sure to add the folder containing the pyspark scripts to parameter 'livy.file.local-dir-whitelist' of livy.conf file. Failing to do so would result in the following error: requirement failed: Local path pi.py cannot be added to user sessions. The command used to submit a batch script is as follows: curl -X POST --data '{"file": "file:/home/osboxes/spark-1.6.0/examples/src/main/python/pi.py"}' -H "Content-Type: application/json" localhost:8998/batches | python -m json.too l This command runs the spark
Solution 1 : Problem : randds (in line 9) can have repeated numbers. Solution 2 : Problem : There is a certain pattern to this sampled data. Solution 3 : Problem : This is slow.
Comments
Post a Comment