PySpark - Classification, Regression and Clustering Job

This week, I was able to run Logistic Regression, Linear Regression and k-means using data streamed into PySpark using HPCCFuseJ. The code can be found here.

Note:
- PySpark + HPCCFuseJ currently treats HPCC file as local file.
- The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records.
- The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.

Comments

Popular posts from this blog

PySpark Error - EOFError

How to set up a Spark multi-node cluster on AWS-EC2

Performance comparison between FUSE plugin based download and Browser based download - 2.0