systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <>
Subject Re: Using GLM with Spark
Date Tue, 08 Dec 2015 07:11:12 GMT

Hi Sourav,

Your understanding is correct, X and Y can be supplied either as a file or
as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
former mechanism (i.e. passing as file) pushes the reading/reblocking into
the optimizer, while the latter mechanism allows for preprocessing of data
(for example: using Spark SQL).

Two use-cases when X and Y are supplied as files on HDFS:
1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log

2. Using MLContext but without registering X and Y as input. Instead we
pass filenames as command-line parameters:
> val ml = new MLContext(sc)
> val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
-> "2", "link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)

As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> val ml = new MLContext(sc)
> ml.registerInput("X", xDF)
> ml.registerInput("Y", yDF)
> val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2",
"link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)

One important thing that I must point is the concept of "ifdef". It is
explained in the section
Here is snippet from the DML script for GLM:
fileX = $X;
fileY = $Y;
fileO = ifdef ($O, " ");
fmtB = ifdef ($fmt, "text");
distribution_type = ifdef ($dfam, 1);

The above DML code essentially says $X and $Y are required parameters (a
design decision that GLM script writer made), whereas $fmt and $dfam are
optional as they are assigned default values when not explicitly provided.
Both these constructs are important tools in the arsenal of DML script
writer. By not guarding a dollar parameter with ifdef, the DML script
writer ensures that the user has to provide its value (in this case file
names for X and Y). This is why, you will notice that I have provide a
space for X, Y and B in the second MLContext snippet.


Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At

From:	Sourav Mazumder <>
Date:	12/07/2015 07:30 PM
Subject:	Using GLM with Spark


Trying to use GLM with Spark.

I go through the documentation of the same in

I see that inputs like X and Y have to supplied using a file and the file
has to be there in HDFS.

Is this understanding correct ? Can't X and Y be supplied using a Data
Frame from a Spark Context (as in case of example of LinearRegression in


  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message