Search This Blog

ECL-Summer-2017

PySpark installation

Get link
Facebook
X
Pinterest
Email
Other Apps

June 05, 2017

PySpark is very easy to install by just following the steps listed in this link. The only catch is compilation step takes close to 30 mins.

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

PySpark Error - EOFError

June 09, 2017

I send almost the whole afternoon trying to debug an EOFError while implementing KMeansStreaming which uses a streaming THOR file as input. ( code ). I found the reason why I was getting the error. There were two reasons: The first clue was “java.lang.IllegalArgumentException: requirement failed” – which means the training and the testing data are not of the same dimension. The second clue was actually the EOFError – which is actually an out of memory error. The workaround was to increase the memory allocated using the option –executor-memory.

How to set up a Spark multi-node cluster on AWS-EC2

July 19, 2017

Context Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC. Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this , which explain how to Spark multi-node cluster. But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process. Disclaimer Most of the content of this document is from here with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons). Steps 1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the port...

How to access the VM from the host machine

July 06, 2017

Context: To connect to Spark using the EMBED feature of ECL along with LIVY. Since, I am running my Spark on a standalone system inside a Ubuntu 16.10 VM and I still wanted to used ECLIDE, I decided that I should be able to connect to VM from the host (windows) machine. This video clearly describes how to set this up.