PySpark - Classification, Regression and Clustering Job

June 16, 2017

This week, I was able to run Logistic Regression, Linear Regression and k-means using data streamed into PySpark using HPCCFuseJ. The code can be found here.

Note:
- PySpark + HPCCFuseJ currently treats HPCC file as local file.
- The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records.
- The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.

Search This Blog

ECL-Summer-2017

PySpark - Classification, Regression and Clustering Job

Comments

Post a Comment

Popular posts from this blog

PySpark Error - EOFError

How to set up a Spark multi-node cluster on AWS-EC2

How to access the VM from the host machine