PySpark - Classification, Regression and Clustering Job

This week, I was able to run Logistic Regression, Linear Regression and k-means using data streamed into PySpark using HPCCFuseJ. The code can be found here.

Note:
- PySpark + HPCCFuseJ currently treats HPCC file as local file.
- The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records.
- The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.

Comments

Popular posts from this blog

How to set up a Spark multi-node cluster on AWS-EC2

How to use REST based calls to submit spark jobs