PySpark - Classification, Regression and Clustering Job
This week, I was able to run Logistic Regression, Linear Regression and k-means using data streamed into PySpark using HPCCFuseJ. The code can be found here.
Note:
- PySpark + HPCCFuseJ currently treats HPCC file as local file.
- The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records.
- The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.
Note:
- PySpark + HPCCFuseJ currently treats HPCC file as local file.
- The implementation has been used only to simple flat thor files. It has not been tested on more complicated datasets such as datasets with child datasets or records.
- The performance of the HPCCFuseJ has not been assessed yet. A good idea would be to test the performance of HPCCFuseJ and compare it to just scp'ng the file to the local machine.
Comments
Post a Comment