Hi Sourav,

Couple of questions to make sure we are on same page: does the "`dependent variable (double)`" represents the class labels ? Are the values of the class labels from 1 to numClasses (i..e one-based) ?

Here are few comments regarding correlating IDs:

To represent an unordered collection (i.e. DataFrame) to an ordered collection ("Matrix"), we add special column "ID" which represents __one-based row index__. Please perform following steps:

1. Accept recent changes from https://github.com/apache/incubator-systemml and use the generated jar.

2. Map the unique id in DF1 to int (__1 to number of rows__) and call that column 'ID'.

3. Use the variant of registerInput for both X (both for training and predicting) and Y:

registerInput(String varName, DataFrame df, **b****oolean** containsID)

As a side note: instead of separate double columns, you can represent them using VectorUDT and use our converter "JavaPairRDD<MatrixIndexes, MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame inputDF, MatrixCharacteristics mcOut, **boolean** containsID, String vectorColumnName) "

Thanks,

Niketan Pansare

IBM Almaden Research Center

E-mail: npansar At us.ibm.com

http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar

Sourav Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided works fine. The use of getMatrixCharacteristics

From: Sourav Mazumder <sourav.mazumder00@gmail.com>

To: dev@systemml.incubator.apache.org

Date: 12/09/2015 11:15 AM

Subject: Re: Using GLM-predict

The code you provided works fine. The use of getMatrixCharacteristics

solves the basic execution problem.

However, question #3 is probably not yet unresolved. Let me explain the use

case scenario I'm trying to build.

1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of

columns (say 4) which are to be used as features (double), and a column for

the dependent variable (double).

2. When I created the model I created a data frame (DF2) from DF1 using

only the feature vectors and pass that as X. And the column with dependent

value is passed as Y.

3. For calling the GLM-predict I'm using another data frame (DF3) of same

structure but with different Unique ID (essentially different

records/rows). From that data frame I'm first creating another data frame

(DF4) containing the columns representing the features. Then I'm sending

DF4 to GLM-predict which has only feature vectors.

4. The response I get from GLM-predict is the 'means'. Then I'm using the

inline predict script which returns another data frame {DF5) with ID and

Predicted values.

The question is how do I correlate the ID I'm getting from DF5 with the

Unique ID of the data frame DF3 ?

Regards,

Sourav

On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare <npansar@us.ibm.com> wrote:

> Hi Sourav,

>

> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my

> understanding it is same as the probability matrix u have mentioned in your

> mail (to be used to compute the prediction). Am I right ?

> Yes, that's correct.

>

> 2. From GLM.dml I get the 'betas' as output using

> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml

> as B.

>

> Can you try this ?

> // Get output from GLM

> val beta = outputs.getBinaryBlockedRDD("beta_out")

> val betaMC = outputs.getMatrixCharacteristics("beta_out") // This way you

> don't have to worry about dimensions.

> // -----------------------------------------

> val Xin = DataFrame/RDD of values (or even text/csv file) you want to

> predict

> // -----------------------------------------

> // Execute GLM-predict

> ml.reset()

> // Please read

>

> // dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial

> val cmdLineParamsPredict = Map("X" -> " ", "B" -> " ", "dfam" -> "...") //

> family of distribution ?

> ml.registerInput("X", Xin)

> ml.registerInput("B_full", beta, betaMC)

> ml.registerOutput("means")

> val outputsPredict =

> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",

> cmdLineParamsPredict)

> val prob = out.getBinaryBlockedRDD("means");

> val probMC = out.getMatrixCharacteristics("means");

> // -----------------------------------------

> // Get predicted label

> ml.reset()

> ml.registerInput("Prob",prob, probMC)

> ml.registerOutput("Prediction")

> val outputsLabels = = mlNew.executeScript("Prob = read(\"temp1\"); "

> + "Prediction = rowIndexMax(Prob); "

> + "write(Prediction, \"tempOut\", \"csv\")")

> val pred = outputsLabels.getDF(sqlContext,

> "Prediction").withColumnRenamed("C1", "prediction")

> // -----------------------------------------

>

>

> 3. Say I get back prediction matrix as an output (from predictions =

> rowIndexMax(means);). Now can I read add that as a column to my original

> data frame (the one from which I created the feature vector for the

> original model) ? My concern is whether adding back will ensure the right

> order so that teh key for the feature vector and the predicted value remain

> same ? If not how to achieve the same ?

> In above example 'pred' is a DataFrame with column 'ID' which provides the

> row ID.

>

> Thanks,

>

> Niketan Pansare

> IBM Almaden Research Center

> E-mail: npansar At us.ibm.com

>

>

> [image: Inactive hide details for Sourav Mazumder ---12/08/2015 10:53:40

> PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder

> ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed

> inputs.

>

> From: Sourav Mazumder <sourav.mazumder00@gmail.com>

> To: dev@systemml.incubator.apache.org, Niketan Pansare/Almaden/IBM@IBMUS

> Date: 12/08/2015 10:53 PM

> Subject: Re: Using GLM-predict

> ------------------------------

>

>

>

> Hi Niketan,

>

> Thanks again for the detailed inputs.

>

> Some more follow up Qs -

>

> 1. In the GLM-predict.dml I could see 'means' is the output variable. In my

> understanding it is same as the probability matrix u have mentioned in your

> mail (to be used to compute the prediction). Am I right ?

>

> 2. From GLM.dml I get the 'betas' as output using

> outputs.getBinaryBlockedRDD("beta_out"). The same I pass to GLM-predict.dml

> as B. For registering B following statements are used

> val beta = outputs.getBinaryBlockedRDD("beta_out")

> ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I get 4

> coefficients

>

> However, when I execute GLM-predict.dml I get following error.

>

> val outputs =

> ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",

> cmdLineParams)

>

> 15/12/09 05:32:47 WARN Expression: Metadata file: .mtd not provided

> 15/12/09 05:32:47 ERROR Expression: ERROR:

> /home/system-ml-0.9.0-SNAPSHOT/algori

> thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete

> dimensio

> n information in read statement: .mtd

> com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR:

> /home/syste

> m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --

> Miss

> ing or incomplete dimension information in read statement: .mtd

>

> In line 117 we have following statement : X = read (fileX);

>

> 3. Say I get back prediction matrix as an output (from predictions =

> rowIndexMax(means);). Now can I read add that as a column to my original

> data frame (the one from which I created the feature vector for the

> original model) ? My concern is whether adding back will ensure the right

> order so that teh key for the feature vector and the predicted value remain

> same ? If not how to achieve the same ?

>

> Regards,

> Sourav

>

>

>

>

>

> On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare <npansar@us.ibm.com>

> wrote:

>

> > Hi Sourav,

> >

> > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 12:56:38

> > -0800*

> > <

>

> "

> > (which I noticed in the archive).

> >

> > >> Not sure how exactly I can modify the GLM-predict.dml to get some

> > prediction to start with.

> > There are two options here:

> > 1. Modify GLM-predict.dml as suggested by Shirish (better approach with

> > respect to the SystemML optimizer) or

> >

> > 2. Run a new script on the output of GLM-predict. Please see:

> >

>

> > If you chose to go with option 2, you might also want to read the

> > documentation of following two built-in functions:

> > a. rowIndexMax (See

> >

>

> > <

>

> >

> > )

> > b. ppred

> >

> > >> Can you give me some idea how from here I can calculate the predicted

> > value of the label using some value of probability threshold ?

> > Very simple way to predict the label given probability matrix:

> > Prediction = rowIndexMax(Prob) # predicts the label with highest

> > probability. This assumes one-based labels.

> >

> > Thanks,

> >

> > Niketan Pansare

> > IBM Almaden Research Center

> > E-mail: npansar At us.ibm.com

> >

> >

> > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015

> 12:49:47

> > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish

> > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml

> gives

> > out only the probabilities. You can put a

> >

> > From: Shirish Tatikonda <shirish.tatikonda@gmail.com>

> > To: dev@systemml.incubator.apache.org

> > Date: 12/08/2015 12:49 PM

> > Subject: Re: Using GLM-predict

> > ------------------------------

> >

> >

> >

> > Hi Sourav,

> >

> > Yes, GLM-predict.dml gives out only the probabilities. You can put a

> > threshold on the resulting probabilities to get the actual class labels

> --

> > for example, prob > 0.5 is positive and <=0.5 as negative.

> >

> > The exact value of threshold typically depends on the data and the

> > application. Different thresholds yield different classifiers with

> > different performance (precision, recall, etc.). You can find the best

> > threshold for the given data set by finding a value that gives the

> desired

> > classifier performance (for example, a threshold that gives roughly equal

> > precision and recall). Such an optimization is obviously done during the

> > training phase using a held out test set.

> >

> > If you wish, you can also modify the DML script to perform this entire

> > process.

> >

> > Shirish

> >

> >

> > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder <

> > sourav.mazumder00@gmail.com> wrote:

> >

> > > Hi,

> > >

> > > I have used GLM.dml to create a model using some sample data. It

> returns

> > to

> > > me the matrix of Beta, B.

> > >

> > > Now I want to use this matrix of Beta on a new set of data points and

> > > generate predicted value of the dependent variable/observation.

> > >

> > > When I checked GLM-predict, I could see that one can pass feature

> vector

> > > for the new data set and also the matrix of beta.

> > >

> > > But I could not see any way to get the predicted value of the dependent

> > > variable/observation. The output parameter only supports matrix of

> > > predicted means/probabilities.

> > >

> > > Is there a way one can get the predicted value of the dependent

> > > variable/observation from GLM-predict ?

> > >

> > > Regards,

> > > Sourav

> > >

> >

> >

> >

>

>

>