systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niketan Pansare" <>
Subject Re: Using GLM with Spark
Date Tue, 08 Dec 2015 23:02:43 GMT

Hi Sourav,

I guess you found the answer for question (a) based on recent email

>> b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?
Good question. While implementing MLContext, the key requirement was the
DML script should remain the same irrespective to invocation or backends
(i.e MLContext, command-line using Spark/Hadoop, standalone mode,
spark-shell, Jupyter, pyspark, etc.). This meant that we had to provide at
least two (or three if you count JMLC) mechanism for input matrices:
1. File name
2. RDD/DataFrame

Consider the following piece of DML code:
foo = read($bar)
fileFoo = $bar
foo = read(fileFoo)

Here, you can either call registerInput("foo", RDD) or registerInput("bar",
RDD). We decided to go with former approach (I will skip the reasons for
now). To remain consistent with the semantics of dollar parameters, we
ought to throw error if no value is provided for $bar. Hence they need to
be provided. I understand in above case, we can avoid it because we have
knowledge of which variables are registered. But I think special casing
situations is bad idea as it can break the language semantics in corner
fileFoo = $bar + ".bak"
foo = read(fileFoo)


Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At

From:	Sourav Mazumder <>
Date:	12/08/2015 07:11 AM
Subject:	Re: Using GLM with Spark

Hi Niketan,

Thanks a lot again for detailed clarification and example.

I do suggest to mention explicitly in the documentation that X and y can be
passed as Data Frame/RDD in case of Spark. It is not very clear from the
documentation. Right now the documentation sort of gives idea that having
Hadoop cluster is a need to execute this where as I'm looking for an end to
end execution of System ML only using Spark (without using Hadoop at all).

Next, questions I have are -

a)  How do I get back the B after I execute GLM in Spark ( ml.execute() ) ?
I need to use the same as an input to GLM-predict for using the model. And
I don't want to incur additional i/o. Can I use something like ml.get()
which will return back the B in a Matrix form ?

b) What is the use of the parameter cmdLineParams ? If I am anyway
supplying X and y, the mandatory parameters why do I need to pass this
parameter again ?


On Mon, Dec 7, 2015 at 11:11 PM, Niketan Pansare <>

> Hi Sourav,
> Your understanding is correct, X and Y can be supplied either as a file
> as a RDD/DataFrame. Each of these two mechanisms has its own benefits.
> former mechanism (i.e. passing as file) pushes the reading/reblocking
> the optimizer, while the latter mechanism allows for preprocessing of
> (for example: using Spark SQL).
> Two use-cases when X and Y are supplied as files on HDFS:
> 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
> SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
> tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
> B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
> 2. Using MLContext but without registering X and Y as input. Instead we
> pass filenames as command-line parameters:
> > val ml = new MLContext(sc)
> > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y",
> -> "2", "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
> As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> > val ml = new MLContext(sc)
> > ml.registerInput("X", xDF)
> > ml.registerInput("Y", yDF)
> > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" ->
> "link" -> "2", ...)
> > ml.execute("GLM.dml", cmdLineParams)
> One important thing that I must point is the concept of "ifdef". It is
> explained in the section
> Here is snippet from the DML script for GLM:

> fileX = $X;
> fileY = $Y;
> fileO = ifdef ($O, " ");
> fmtB = ifdef ($fmt, "text");
> distribution_type = ifdef ($dfam, 1);
> The above DML code essentially says $X and $Y are required parameters (a
> design decision that GLM script writer made), whereas $fmt and $dfam are
> optional as they are assigned default values when not explicitly
> Both these constructs are important tools in the arsenal of DML script
> writer. By not guarding a dollar parameter with ifdef, the DML script
> writer ensures that the user has to provide its value (in this case file
> names for X and Y). This is why, you will notice that I have provide a
> space for X, Y and B in the second MLContext snippet.
> Thanks,
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At
> [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06
> PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015
> 07:30:06 PM---Hi, Trying to use GLM with Spark.
> From: Sourav Mazumder <>
> To:
> Date: 12/07/2015 07:30 PM
> Subject: Using GLM with Spark
> ------------------------------
> Hi,
> Trying to use GLM with Spark.
> I go through the documentation of the same in

> I see that inputs like X and Y have to supplied using a file and the file
> has to be there in HDFS.
> Is this understanding correct ? Can't X and Y be supplied using a Data
> Frame from a Spark Context (as in case of example of LinearRegression in

> )
> ?
> Regards,
> Sourav

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message