Hi Sourav,

Couple of questions to make sure we are on=
same page: does the "`dependent variable (double)`" repre=
sents the class labels ? Are the values of the class labels from 1 to numCl=
asses (i..e one-based) ?

Here are few comments regarding correlating=
IDs:

To represent an unordered collection (i.e. DataFrame) to an or=
dered collection ("Matrix"), we add special column "ID"=
which represents __one-based row index__. Please perform following step=
s:

1. Accept recent changes from https://github.com/apache/incubator-systemml and use =
the generated jar.

2. Map the unique id in DF1 to int (__1 to numbe=
r of rows__) and call that column 'ID'.

3. Use the variant of reg=
isterInput for both X (both for training and predicting) and Y:

registerInput(String varName, DataFrame df, **b****oolean** containsID)

As a side note: instead of separate double co=
lumns, you can represent them using VectorUDT and use our converter "<=
font face=3D"Consolas">JavaPairRDD<MatrixIndexes, MatrixBlock> vector=
DataFrameToBinaryBlock(JavaSparkContext sc, DataFrame inputDF, Ma=
trixCharacteristics mcOut<=
/font>, **boolean** containsID, String=
vectorColumnName

Thanks,

Niketan Pansare=

IBM Almaden Research Center

E-mail: npansar At us.ibm.com

http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar=

Sourav Mazumder ---1=
2/09/2015 11:15:19 AM---Hi Niketan, The code you provided works fine. The u=
se of getMatrixCharacteristics

From: Sourav Mazumder <sourav.mazumd=
er00@gmail.com>

To: =
dev@systemml.incubator.apache.org

Date: 12/09/2015=
11:15 AM

Subject: Re: Using GLM-predict

The code you provided works fine. The use of getMatrixChara= cteristics

solves the basic execution problem.

However, question = #3 is probably not yet unresolved. Let me explain the use

case scenario = I'm trying to build.

1. Say I have a data frame (DF1) with a Unique = Id (string), a bunch of

columns (say 4) which are to be used as features= (double), and a column for

the dependent variable (double).

2. When = I created the model I created a data frame (DF2) from DF1 using

only the= feature vectors and pass that as X. And the column with dependent

value= is passed as Y.

3. For calling the GLM-predict I'm using another data f= rame (DF3) of same

structure but with different Unique ID (essentially d= ifferent

records/rows). From that data frame I'm first creating another = data frame

(DF4) containing the columns representing the features. Then = I'm sending

DF4 to GLM-predict which has only feature vectors.

4. The= response I get from GLM-predict is the 'means'. Then I'm using the

inli= ne predict script which returns another data frame {DF5) with ID and

Pre= dicted values.

The question is how do I correlate the ID I'm getting= from DF5 with the

Unique ID of the data frame DF3 ?

Regards,

= Sourav

On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare &= lt;npansar@us.ibm.com> wrote:

> Hi Sourav,

>

> 1. = In the GLM-predict.dml I could see 'means' is the output variable. In my

> understanding it is same as the probability matrix u have mentioned i= n your

> mail (to be used to compute the prediction). Am I right ?

> Yes, that's correct.

>

> 2. From GLM.dml I get the 'betas= ' as output using

> outputs.getBinaryBlockedRDD("beta=5Fout"= ;). The same I pass to GLM-predict.dml

> as B.

>

> Can yo= u try this ?

> // Get output from GLM

> val beta =3D outputs.ge= tBinaryBlockedRDD("beta=5Fout")

> val betaMC =3D outputs.ge= tMatrixCharacteristics("beta=5Fout") // This way you

> don'= t have to worry about dimensions.

> // ------------------------------= -----------

> val Xin =3D DataFrame/RDD of values (or even text/csv f= ile) you want to

> predict

> // -------------------------------= ----------

> // Execute GLM-predict

> ml.reset()

> // Ple= ase read

>

>= ; // dfam Int 1 Distribution family code: 1 =3D Power, 2 =3D Binomial

&g= t; val cmdLineParamsPredict =3D Map("X" -> " ", &quo= t;B" -> " ", "dfam" -> "...") //

> ml.registerInput("X", Xin= )

> ml.registerInput("B=5Ffull", beta, betaMC)

> ml.r= egisterOutput("means")

> val outputsPredict =3D

> ml.= execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml&quo= t;,

> cmdLineParamsPredict)

> val prob =3D out.getBinaryBlocked= RDD("means");

> val probMC =3D out.getMatrixCharacteristics= ("means");

> // -----------------------------------------

> ml.reset()

> ml.registerInput(&= quot;Prob",prob, probMC)

> ml.registerOutput("Prediction&qu= ot;)

> val outputsLabels =3D =3D mlNew.executeScript("Prob =3D r= ead(\"temp1\"); "

> + "Prediction =3D rowIndexMax= (Prob); "

> + "write(Prediction, \"tempOut\", \&q= uot;csv\")")

> val pred =3D outputsLabels.getDF(sqlContext,=

> "Prediction").withColumnRenamed("C1", "pr= ediction")

> // -----------------------------------------

>= ;

>

> 3. Say I get back prediction matrix as an output (from pr= edictions =3D

> rowIndexMax(means);). Now can I read add that as a co= lumn to my original

> data frame (the one from which I created the fe= ature vector for the

> original model) ? My concern is whether adding= back will ensure the right

> order so that teh key for the feature v= ector and the predicted value remain

> same ? If not how to achieve t= he same ?

> In above example 'pred' is a DataFrame with column 'ID' w= hich provides the

> row ID.

>

> Thanks,

>

> N= iketan Pansare

> IBM Almaden Research Center

> E-mail: npansar = At us.ibm.com

>

>

> [image:= Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40

> P= M---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder

&g= t; ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed

> inputs.

>

> From: Sourav Mazumder <sourav.mazumder00@g= mail.com>

> To: dev@systemml.incubator.apache.org, Niketan Pansare= /Almaden/IBM@IBMUS

> Date: 12/08/2015 10:53 PM

> Subject: Re: U= sing GLM-predict

> ------------------------------

>

>

= >

> Hi Niketan,

>

> Thanks again for the detailed inpu= ts.

>

> Some more follow up Qs -

>

> 1. In the GLM-= predict.dml I could see 'means' is the output variable. In my

> under= standing it is same as the probability matrix u have mentioned in your

&= gt; mail (to be used to compute the prediction). Am I right ?

>

&g= t; 2. From GLM.dml I get the 'betas' as output using

> outputs.getBin= aryBlockedRDD("beta=5Fout"). The same I pass to GLM-predict.dml

> val be= ta =3D outputs.getBinaryBlockedRDD("beta=5Fout")

> ml.regis= terInput("B", beta, 1, 4) // I have four feature vectors so I get= 4

> coefficients

>

> However, when I execute GLM-predict= .dml I get following error.

>

> val outputs =3D

> ml.exec= ute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",<= br>> cmdLineParams)

>

> 15/12/09 05:32:47 WARN Expression: M= etadata file: .mtd not provided

> 15/12/09 05:32:47 ERROR Expre= ssion: ERROR:

> /home/system-ml-0.9.0-SNAPSHOT/algori

> thms/GL= M-predict.dml -- line 117, column 8 -- Missing or incomplete

> dimens= io

> n information in read statement: .mtd

> com.ibm.bi.d= ml.parser.LanguageException: Invalid Parameters : ERROR:

> /home/syst= e

> m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, colum= n 8 --

> Miss

> ing or incomplete dimension information in read= statement: .mtd

>

> In line 117 we have following statem= ent : X =3D read (fileX);

>

> 3. Say I get back prediction matr= ix as an output (from predictions =3D

> rowIndexMax(means);). Now can= I read add that as a column to my original

> data frame (the one fro= m which I created the feature vector for the

> original model) ? My c= oncern is whether adding back will ensure the right

> order so that t= eh key for the feature vector and the predicted value remain

> same ?= If not how to achieve the same ?

>

> Regards,

> Sourav

>

>

>

>

> On Tue, Dec 8, 2015 at 2:08 = PM, Niketan Pansare <npansar@us.ibm.com>

> wrote:

>

&g= t; > Hi Sourav,

> >

> > For some reason, I didn't get = your email on "*Tue, 08 Dec 2015 12:56:38

> > -0800*

> = > <

>

> "

> > (which I noticed in the arch= ive).

> >

> > >> Not sure how exactly I can modify = the GLM-predict.dml to get some

> > prediction to start with.

&= gt; > There are two options here:

> > 1. Modify GLM-predict.dml= as suggested by Shirish (better approach with

> > respect to the = SystemML optimizer) or

> >

> > 2. Run a new script on the= output of GLM-predict. Please see:

> >

>

> > If you chos= e to go with option 2, you might also want to read the

> > documen= tation of following two built-in functions:

> > a. rowIndexMax (Se= e

> >

>

>

> >

> > )

> > b. ppred

> >

> >= ; >> Can you give me some idea how from here I can calculate the pred= icted

> > value of the label using some value of probability thres= hold ?

> > Very simple way to predict the label given probability = matrix:

> > Prediction =3D rowIndexMax(Prob) # predicts the label = with highest

> > probability. This assumes one-based labels.

&g= t; >

> > Thanks,

> >

> > Niketan Pansare

&= gt; > IBM Almaden Research Center

> > E-mail: npansar At us.ibm= .com

> >

> >

> > [= image: Inactive hide details for Shirish Tatikonda ---12/08/2015

> 12= :49:47

> > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the= prob]Shirish

> > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav,= Yes, GLM-predict.dml

> gives

> > out only the probabilities= . You can put a

> >

> > From: Shirish Tatikonda <shiri= sh.tatikonda@gmail.com>

> > To: dev@systemml.incubator.apache.o= rg

> > Date: 12/08/2015 12:49 PM

> > Subject: Re: Using G= LM-predict

> > ------------------------------

> >

>= >

> >

> > Hi Sourav,

> >

> > Yes, G= LM-predict.dml gives out only the probabilities. You can put a

> >= threshold on the resulting probabilities to get the actual class labels

> --

> > for example, prob > 0.5 is positive and <=3D0.5= as negative.

> >

> > The exact value of threshold typica= lly depends on the data and the

> > application. Different thresho= lds yield different classifiers with

> > different performance (pr= ecision, recall, etc.). You can find the best

> > threshold for th= e given data set by finding a value that gives the

> desired

> = > classifier performance (for example, a threshold that gives roughly eq= ual

> > precision and recall). Such an optimization is obviously d= one during the

> > training phase using a held out test set.

&g= t; >

> > If you wish, you can also modify the DML script to per= form this entire

> > process.

> >

> > Shirish

> >

> >

> > On Tue, Dec 8, 2015 at 12:23 PM, Soura= v Mazumder <

> > sourav.mazumder00@gmail.com> wrote:

>= >

> > > Hi,

> > >

> > > I have used= GLM.dml to create a model using some sample data. It

> returns

&g= t; > to

> > > me the matrix of Beta, B.

> > >

> > > Now I want to use this matrix of Beta on a new set of data = points and

> > > generate predicted value of the dependent vari= able/observation.

> > >

> > > When I checked GLM-pr= edict, I could see that one can pass feature

> vector

> > &g= t; for the new data set and also the matrix of beta.

> > >

&= gt; > > But I could not see any way to get the predicted value of the= dependent

> > > variable/observation. The output parameter onl= y supports matrix of

> > > predicted means/probabilities.

&g= t; > >

> > > Is there a way one can get the predicted val= ue of the dependent

> > > variable/observation from GLM-predict= ?

> > >

> > > Regards,

> > > Sourav

> > >

> >

> >

> >

>

>

&= gt;

--1__=8FBBF585DFF93F128f9e8a93df938690918c8FBBF585DFF93F12-- --0__=8FBBF585DFF93F128f9e8a93df938690918c8FBBF585DFF93F12--