Hi Niketan, Thanks a lot again for detailed clarification and example. I do suggest to mention explicitly in the documentation that X and y can be passed as Data Frame/RDD in case of Spark. It is not very clear from the documentation. Right now the documentation sort of gives idea that having Hadoop cluster is a need to execute this where as I'm looking for an end to end execution of System ML only using Spark (without using Hadoop at all). Next, questions I have are - a) How do I get back the B after I execute GLM in Spark ( ml.execute() ) ? I need to use the same as an input to GLM-predict for using the model. And I don't want to incur additional i/o. Can I use something like ml.get() which will return back the B in a Matrix form ? b) What is the use of the parameter cmdLineParams ? If I am anyway supplying X and y, the mandatory parameters why do I need to pass this parameter again ? Regards, Sourav On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare wrote: > Hi Sourav, > > Your understanding is correct, X and Y can be supplied either as a file or > as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The > former mechanism (i.e. passing as file) pushes the reading/reblocking into > the optimizer, while the latter mechanism allows for preprocessing of data > (for example: using Spark SQL). > > Two use-cases when X and Y are supplied as files on HDFS: > 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master .... > SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001 > tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y > B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log > > 2. Using MLContext but without registering X and Y as input. Instead we > pass filenames as command-line parameters: > > val ml = new MLContext(sc) > > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam" > -> "2", "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > As mentioned earlier, X and Y can be provided as RDD/DataFrame as well. > > val ml = new MLContext(sc) > > ml.registerInput("X", xDF) > > ml.registerInput("Y", yDF) > > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2", > "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > One important thing that I must point is the concept of "ifdef". It is > explained in the section > http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments. > > Here is snippet from the DML script for GLM: > https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml > fileX = $X; > fileY = $Y; > fileO = ifdef ($O, " "); > fmtB = ifdef ($fmt, "text"); > distribution_type = ifdef ($dfam, 1); > > The above DML code essentially says $X and $Y are required parameters (a > design decision that GLM script writer made), whereas $fmt and $dfam are > optional as they are assigned default values when not explicitly provided. > Both these constructs are important tools in the arsenal of DML script > writer. By not guarding a dollar parameter with ifdef, the DML script > writer ensures that the user has to provide its value (in this case file > names for X and Y). This is why, you will notice that I have provide a > space for X, Y and B in the second MLContext snippet. > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06 > PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015 > 07:30:06 PM---Hi, Trying to use GLM with Spark. > > From: Sourav Mazumder > To: dev@systemml.incubator.apache.org > Date: 12/07/2015 07:30 PM > Subject: Using GLM with Spark > ------------------------------ > > > > Hi, > > Trying to use GLM with Spark. > > I go through the documentation of the same in > > http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models > I see that inputs like X and Y have to supplied using a file and the file > has to be there in HDFS. > > Is this understanding correct ? Can't X and Y be supplied using a Data > Frame from a Spark Context (as in case of example of LinearRegression in > > http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm > ) > ? > > Regards, > Sourav > > >