Posts

Showing posts from June, 2017

How to use REST based calls to submit spark jobs

How to use Livy: Livy (alpha) enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps which uses a REST calls to communicate with the Spark cluster. Note, during the time of writing this blog, I have only tried Levy on a standalone PySpark setup, so I don't know the challenges involved in setting up Levy in the PySpark Cluster. Since we would be using local file make sure to add the folder containing the pyspark scripts to parameter 'livy.file.local-dir-whitelist' of livy.conf file. Failing to do so would result in the following error:                 requirement failed: Local path pi.py cannot be added to user sessions. The command used to submit a batch script is as follows:  curl -X POST --data '{"file": "file:/home/osboxes/spark-1.6.0/examples/src/main/python/pi.py"}' -H "Content-Type: application/json" localhost:8998/batches  | python -m json.too l This command runs the spark

Performance comparison between FUSE plugin based download and Browser based download

Image
To measure performance between HPCCFuseJ based download and browser-based download, we measure the time taken to copy mounted folder (using HPCCFuseJ) to a local folder and compare it with the download time using the browser. The graph below shows the ratio:  In the figure, it can be seen that the time required to download using HPCCFuseJ increases as the file size increase. The exponential trend of the graph is because of multiple fetch nature of the HPCCFuseJ. HPCCFuseJ works in two phases namely: data fetching and data processing. Data fetching phase uses web service calls to fetch data. Data processing uses the data fetched (during data fetching phase) and converts it to JSON format. This data (in JSON format) is then consumed by the application. The figure below shows the ratio of the time required by total time required (by both phases) to the time required by the data fetch phase. In the figure, it can be seen that as the file size grows the time required by t

Downloading file from the landing zone

Dan and I decided that one way to document the performance of HPCCFuseJ is to compare the performance of HPCCFuseJ with the browser-based download. Browser-based download involves despraying a thor file and then downloading it from ECL-watch. However, we ran into roadblock since I could not find the login credential for the Linux box (containing the landing zone). One of the possible ways to circumvent this problem: is to use the URL from ECLWatch and trigger a download from the command line. Sidenote, we wanted to precisely log the time required for each download. Hence, we wanted a command line utility to perform the download. The following is the process of finding the URL for download (using Chrome): 1. Open the landing zone in ECL Watch and select the file to download 2. Go to Chrome > More Tools > Developer tools > Console 2. trigger a download and the URL would show up on the console. This link (from step 3) can be used to download the file. ( code )

How to test the performance of the fuse plugin?

Dan and I discussed at length about how to test this project: firstly functional properties and secondly the non-functional properties. Functional properties were simple since we can easily compare the contents of the (actual) file with contents of the (fuse downloaded) file. However, the non-functional property of the plugin is a challenge. The current approach is to compare the file copy time using the HPCCFuseJ with the file copy time using tools like winscp. The code used to test the HPCCFuseJ can be found here . This code  has to be used with the dataset generator (developed by Dan) to run.

What happens if there is a childRecords in the table?

Dan's code  can be used to generate child dataset. The HPCCFuseJ currently cannot handle child datasets. The dataset containing child datasets have to be flattened before using it with HPCCFuseJ.

PySpark - Classification, Regression and Clustering Job

This week, I was able to run Logistic Regression, Linear Regression and k-means using data streamed into PySpark using HPCCFuseJ. The code can be found here . Note: - PySpark + HPCCFuseJ currently treats HPCC file as local file. - The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records. - The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.

PySpark Error - EOFError

I send almost the whole afternoon trying to debug an EOFError while implementing KMeansStreaming which uses a streaming THOR file as input. ( code ). I found the reason why I was getting the error. There were two reasons: The first clue was “java.lang.IllegalArgumentException: requirement failed” – which means the training and the testing data are not of the same dimension. The second clue was actually the EOFError – which is actually an out of memory error. The workaround was to increase the memory allocated using the option –executor-memory. 

Share folder between windows and ubuntu VM (virtual box)

This can be used to share data between the host machine and the virtual box VM. Another resource can be this .

PySpark installation

PySpark is very easy to install by just following the steps listed in this link. The only catch is compilation step takes close to 30 mins. 

Objectives

The objective of my project is to build a connector between HPCC world and SPARK world. Motivation HPCC is the heart of LN business and almost all the important data is stored in the HPCC systems. Hence, if an analyst/statistician wants to build a model of (some or all) data using SPARK, she needs to download data to either her local system or move it to a cluster. This can be time-consuming and she needs to be very careful of being compliant with strict data rules. A possible solution to this problem can be having a bridge between these two (SPARK and HPCC) worlds. Expected Output  We expect our output to be in form of a connector when installed can enable ECL programmers to use SPARK algorithms on data stored in HPCC as well as PySpark programmers to use HPCC data. Github Repo