Hi Sourav,
1. In the GLMpredict.dml I could see 'means' is the output variable. In my
understanding it is same as the probability matrix u have mentioned in your
mail (to be used to compute the prediction). Am I right ?
Yes, that's correct.
2. From GLM.dml I get the 'betas' as output using
outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLMpredict.dml
as B.
Can you try this ?
// Get output from GLM
val beta = outputs.getBinaryBlockedRDD("beta_out")
val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you
don't have to worry about dimensions.
// 
val Xin = DataFrame/RDD of values (or even text/csv file) you want to
predict
// 
// Execute GLMpredict
ml.reset()
// Please read
https://github.com/apache/incubatorsystemml/blob/master/scripts/algorithms/GLM.dml
// dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial
val cmdLineParamsPredict = Map("X" > " ", "B" > " ", "dfam" > "...") //
family of distribution ?
ml.registerInput("X", Xin)
ml.registerInput("B_full", beta, betaMC)
ml.registerOutput("means")
val outputsPredict = ml.execute
("/home/systemml0.9.0SNAPSHOT/algorithms/GLMpredict.dml",
cmdLineParamsPredict)
val prob = out.getBinaryBlockedRDD("means");
val probMC = out.getMatrixCharacteristics("means");
// 
// Get predicted label
ml.reset()
ml.registerInput("Prob",prob, probMC)
ml.registerOutput("Prediction")
val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "
+ "Prediction = rowIndexMax(Prob); "
+ "write(Prediction, \"tempOut\", \"csv\")")
val pred = outputsLabels.getDF(sqlContext, "Prediction").withColumnRenamed
("C1", "prediction")
// 
3. Say I get back prediction matrix as an output (from predictions =
rowIndexMax(means);). Now can I read add that as a column to my original
data frame (the one from which I created the feature vector for the
original model) ? My concern is whether adding back will ensure the right
order so that teh key for the feature vector and the predicted value remain
same ? If not how to achieve the same ?
In above example 'pred' is a DataFrame with column 'ID' which provides the
row ID.
Thanks,
Niketan Pansare
IBM Almaden Research Center
Email: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=usnpansar
From: Sourav Mazumder <sourav.mazumder00@gmail.com>
To: dev@systemml.incubator.apache.org, Niketan
Pansare/Almaden/IBM@IBMUS
Date: 12/08/2015 10:53 PM
Subject: Re: Using GLMpredict
Hi Niketan,
Thanks again for the detailed inputs.
Some more follow up Qs 
1. In the GLMpredict.dml I could see 'means' is the output variable. In my
understanding it is same as the probability matrix u have mentioned in your
mail (to be used to compute the prediction). Am I right ?
2. From GLM.dml I get the 'betas' as output using
outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLMpredict.dml
as B. For registering B following statements are used
val beta = outputs.getBinaryBlockedRDD("beta_out")
ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4
coefficients
However, when I execute GLMpredict.dml I get following error.
val outputs =
ml.execute("/home/systemml0.9.0SNAPSHOT/algorithms/GLMpredict.dml",
cmdLineParams)
15/12/09 05:32:47 WARN Expression: Metadata file: .mtd not provided
15/12/09 05:32:47 ERROR Expression: ERROR:
/home/systemml0.9.0SNAPSHOT/algori
thms/GLMpredict.dml  line 117, column 8  Missing or incomplete
dimensio
n information in read statement: .mtd
com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:
/home/syste
mml0.9.0SNAPSHOT/algorithms/GLMpredict.dml  line 117, column 8 
Miss
ing or incomplete dimension information in read statement: .mtd
In line 117 we have following statement : X = read (fileX);
3. Say I get back prediction matrix as an output (from predictions =
rowIndexMax(means);). Now can I read add that as a column to my original
data frame (the one from which I created the feature vector for the
original model) ? My concern is whether adding back will ensure the right
order so that teh key for the feature vector and the predicted value remain
same ? If not how to achieve the same ?
Regards,
Sourav
On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npansar@us.ibm.com> wrote:
> Hi Sourav,
>
> For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38
> 0800*
> <
https://www.mailarchive.com/search?l=dev@systemml.incubator.apache.org&q=date:20151208
> "
> (which I noticed in the archive).
>
> >> Not sure how exactly I can modify the GLMpredict.dml to get some
> prediction to start with.
> There are two options here:
> 1. Modify GLMpredict.dml as suggested by Shirish (better approach with
> respect to the SystemML optimizer) or
>
> 2. Run a new script on the output of GLMpredict. Please see:
>
https://github.com/apache/incubatorsystemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> If you chose to go with option 2, you might also want to read the
> documentation of following two builtin functions:
> a. rowIndexMax (See
>
http://apache.github.io/incubatorsystemml/dmllanguagereference.html#matrixandorscalarcomparisonbuiltinfunctions
> <
http://apache.github.io/incubatorsystemml/dmllanguagereference.html#matrixandorscalarcomparisonbuiltinfunctions
>
> )
> b. ppred
>
> >> Can you give me some idea how from here I can calculate the predicted
> value of the label using some value of probability threshold ?
> Very simple way to predict the label given probability matrix:
> Prediction = rowIndexMax(Prob) # predicts the label with highest
> probability. This assumes onebased labels.
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> Email: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=usnpansar
>
> [image: Inactive hide details for Shirish Tatikonda 12/08/2015
12:49:47
> PMHi Sourav, Yes, GLMpredict.dml gives out only the prob]Shirish
> Tatikonda 12/08/2015 12:49:47 PMHi Sourav, Yes, GLMpredict.dml
gives
> out only the probabilities. You can put a
>
> From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 12/08/2015 12:49 PM
> Subject: Re: Using GLMpredict
> 
>
>
>
> Hi Sourav,
>
> Yes, GLMpredict.dml gives out only the probabilities. You can put a
> threshold on the resulting probabilities to get the actual class labels

> for example, prob > 0.5 is positive and <=0.5 as negative.
>
> The exact value of threshold typically depends on the data and the
> application. Different thresholds yield different classifiers with
> different performance (precision, recall, etc.). You can find the best
> threshold for the given data set by finding a value that gives the
desired
> classifier performance (for example, a threshold that gives roughly equal
> precision and recall). Such an optimization is obviously done during the
> training phase using a held out test set.
>
> If you wish, you can also modify the DML script to perform this entire
> process.
>
> Shirish
>
>
> On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <
> sourav.mazumder00@gmail.com> wrote:
>
> > Hi,
> >
> > I have used GLM.dml to create a model using some sample data. It
returns
> to
> > me the matrix of Beta, B.
> >
> > Now I want to use this matrix of Beta on a new set of data points and
> > generate predicted value of the dependent variable/observation.
> >
> > When I checked GLMpredict, I could see that one can pass feature
vector
> > for the new data set and also the matrix of beta.
> >
> > But I could not see any way to get the predicted value of the dependent
> > variable/observation. The output parameter only supports matrix of
> > predicted means/probabilities.
> >
> > Is there a way one can get the predicted value of the dependent
> > variable/observation from GLMpredict ?
> >
> > Regards,
> > Sourav
> >
>
>
>
