How many iterations are you doing on the data? Like Jörn said, you don't necessarily need a billion samples for linear regression.

On Tue, Aug 22, 2017 at 6:28 PM, Sea aj <saj3saj@gmail.com> wrote:
Jorn,

My question is not about the model type but instead, the spark capability on reusing any already trained ml model in training a new model. 




On Tue, Aug 22, 2017 at 1:13 PM, Jörn Franke <jornfranke@gmail.com> wrote:
Is it really required to have one billion samples for just linear regression? Probably your model would do equally well with much less samples. Have you checked bias and variance if you use much less random samples?

On 22. Aug 2017, at 12:58, Sea aj <saj3saj@gmail.com> wrote:

I have a large dataframe of 1 billion rows of type LabeledPoint. I tried to train a linear regression model on the df but it failed due to lack of memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of CPU.

I decided to split my data into multiple chunks and train the model in multiple phases but I learned the linear regression model in ml library does not have "setinitialmodel" function to be able to pass the trained model from one chunk to the rest of chunks. In another word, each time I call the fit function over a chunk of my data, it overwrites the previous mode.

So far the only solution I found is using Spark Streaming to be able to split the data to multiple dfs and then train over each individually to overcome memory issue.

Do you know if there's any other solution?




On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar <jayantbayarea@gmail.com> wrote:
Hello Mahesh,

We have built one. You can download from here : https://www.sparkflows.io/download

Feel free to ping me for any questions, etc.

Best Regards,
Jayant


On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <mahesh_sawaiker@persistent.com> wrote:

Hi,


1) Is anyone aware of any workbench kind of tool to run ML jobs in spark. Specifically is the tool  could be something like a Web application that is configured to connect to a spark cluster.


User is able to select input training sets probably from hdfs , train and then run predictions, without having to write any Scala code.


2) If there is not tool, is there value in having such tool, what could be the challenges.


Thanks,

Mahesh


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.






--
Cheers!