From dev-return-114-apmail-systemml-dev-archive=systemml.apache.org@systemml.incubator.apache.org Thu Dec 10 05:56:37 2015 Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9CB1218157 for ; Thu, 10 Dec 2015 05:56:37 +0000 (UTC) Received: (qmail 27937 invoked by uid 500); 10 Dec 2015 05:56:37 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 27893 invoked by uid 500); 10 Dec 2015 05:56:37 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 27881 invoked by uid 99); 10 Dec 2015 05:56:37 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Dec 2015 05:56:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B18D61A429F for ; Thu, 10 Dec 2015 05:56:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.445 X-Spam-Level: **** X-Spam-Status: No, score=4.445 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, MIME_HEADER_CTYPE_ONLY=1.996, MSGID_FROM_MTA_HEADER=0.001, RP_MATCHES_RCVD=-0.554, TVD_FW_GRAPHIC_NAME_MID=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id BU-mDCFykswF for ; Thu, 10 Dec 2015 05:56:23 +0000 (UTC) Received: from e18.ny.us.ibm.com (e18.ny.us.ibm.com [129.33.205.208]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id E9B7F429CC for ; Thu, 10 Dec 2015 05:56:22 +0000 (UTC) Received: from localhost by e18.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 10 Dec 2015 00:56:22 -0500 Received: from d01dlp02.pok.ibm.com (9.56.250.167) by e18.ny.us.ibm.com (146.89.104.205) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 10 Dec 2015 00:56:21 -0500 X-IBM-Helo: d01dlp02.pok.ibm.com X-IBM-MailFrom: npansar@us.ibm.com X-IBM-RcptTo: dev@systemml.incubator.apache.org Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id 4E2386E804D for ; Thu, 10 Dec 2015 00:44:30 -0500 (EST) Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id tBA5uKUR25296964 for ; Thu, 10 Dec 2015 05:56:20 GMT Received: from d01av01.pok.ibm.com (localhost [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id tBA5uK7b008529 for ; Thu, 10 Dec 2015 00:56:20 -0500 Received: from d50lp31.co.us.ibm.com (d50lp31.boulder.ibm.com [9.17.249.32]) by d01av01.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVin) with ESMTP id tBA5uIwp008387 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Thu, 10 Dec 2015 00:56:19 -0500 Message-Id: <201512100556.tBA5uIwp008387@d01av01.pok.ibm.com> Received: from localhost by d50lp31.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 9 Dec 2015 22:56:18 -0700 Received: from smtp.notes.na.collabserv.com (192.155.248.74) by d50lp31.co.us.ibm.com (192.168.2.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256/256) Wed, 9 Dec 2015 22:56:17 -0700 X-IBM-Helo: smtp.notes.na.collabserv.com X-IBM-MailFrom: npansar@us.ibm.com X-IBM-RcptTo: dev@systemml.incubator.apache.org Received: from /spool/local by smtp.notes.na.collabserv.com with smtp.notes.na.collabserv.com ESMTP for from ; Thu, 10 Dec 2015 05:56:16 -0000 Received: from us1a3-smtp04.a3.dal06.isc4sb.com (10.106.154.237) by smtp.notes.na.collabserv.com (10.106.227.92) with smtp.notes.na.collabserv.com ESMTP; Thu, 10 Dec 2015 05:56:13 -0000 Received: from us1a3-mail56.a3.dal09.isc4sb.com ([10.142.3.44]) by us1a3-smtp04.a3.dal06.isc4sb.com with ESMTP id 2015121005561119-50749 ; Thu, 10 Dec 2015 05:56:11 +0000 In-Reply-To: Subject: Re: Using GLM-predict To: dev@systemml.incubator.apache.org From: "Niketan Pansare" Date: Wed, 9 Dec 2015 21:56:11 -0800 References: <201512091717.tB9HHvfj012954@d01av04.pok.ibm.com><201512092053.tB9KrJWD018913@d01av03.pok.ibm.com><201512100014.tBA0EdqW020970@d03av04.boulder.ibm.com> X-KeepSent: 8677DA47:4E2348EA-00257F17:001C655B; type=4; name=$KeepSent X-Mailer: IBM Notes Release 9.0.1FP2 SHF37 August 25, 2014 X-LLNOutbound: False X-Disclaimed: 36163 X-TNEFEvaluated: 1 Content-type: multipart/related; Boundary="0__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB" x-cbid: 15121005-0045-0000-0000-000002AB5AC9 X-IBM-ISS-SpamDetectors: Score=0.421136; BY=0.294001; FL=0; FP=0; FZ=0; HX=0; KW=0; PH=0; SC=0.421136; ST=0; TS=0; UL=0; ISC= X-IBM-ISS-DetailInfo: BY=3.00004651; HX=3.00000236; KW=3.00000007; PH=3.00000004; SC=3.00000123; SDB=6.00628625; UDB=6.00280919; UTC=2015-12-10 05:56:14 x-cbparentid: 15121005-5920-0000-0000-000005457B6C X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER --0__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB Content-type: multipart/alternative; Boundary="1__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB" --1__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB Content-Transfer-Encoding: quoted-printable Content-type: text/plain; charset=US-ASCII Hi Sourav, There are two possible options here: 1. If "unique=5Fid" is one-based integer column: In this case, please rename "unique=5Fid" column to ID and use registerInput("X", DF1, true) method. 2. If "unique=5Fid" is anything else (for example: String), then there is no trivial way for SystemML to correlate "string-based unique id" to row index (which is required to interpret a DataFrame into a matrix). This means you have to explicitly add the column ID to DF1: val dataset =3D RDDConverterUtilsExt.addIDToDataFrame(DF1, sqlContext, "ID") When you get DF5 from GLM-predict.dml, you can use following two lines of code which guarantees correct mapping: val DF5 =3D outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1", "prediction") // Note: there already is a column ID in DF5 which specifies the row index. val output =3D dataset1.join(pred, dataset1.col("ID").equalTo(pred.col("ID" ))) Note: once DataFrame is passed to SystemML via registerInput, SystemML first converts the DataFrame into binary block (i.e. JavaPairRDD) and executes GLM-predict.dml using the binary block. After execution, the output is present in MLOutput ( https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/= apache/sysml/api/MLOutput.java#L89 ) in binary block format. If user choses to, he/she may call getDF(...) which does DataFrame to binary block conversion. For DataFrame to binary block conversion, see https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/= apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277 ... ordering specified by zipWithIndex (which is also used by RDDConverterUtilsExt.addIDToDataFrame) For binary block to DataFrame conversion, see https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/= apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L364 ... ordering specified by internal binary block format and hence we append an extra column ID to specify this ordering. Thanks, Niketan Pansare IBM Almaden Research Center E-mail: npansar At us.ibm.com http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar From: Sourav Mazumder To: dev@systemml.incubator.apache.org Date: 12/09/2015 06:20 PM Subject: Re: Using GLM-predict Hi Niketan, Thanks again for such a detailed explanation. I see your last point and in agreement with the same. Also I got your point on the use of "means" for gaussian vs other distributions. However, I'm still not convinced about the approach you mentioned for correlating the unique id. I've already tried a code similar to what you sent where I've used the vectorAssembler utility of Spark ML LIb. Let me try to explain the problem with more details - 1. Say my original data frame DF1 is distributed in 3 slave nodes in a Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has a unique identifier column say unique=5Fid. 2. Now I used your code to create the feature vector from DF1 and pass it to GLM-predict. And GLM-predict in turn returns me another data frame (say DF5) of "means" (in this case say prediction). However, the rows of DF5 may be distributed in 4 slave nodes each having say 15 rows. Total 60 rows. 3. Now if I just add this new data frame (DF5) as additional two columns to DF1 where is the guarantee that for a specific unique=5Fid of DF1 I'm getti= ng right mean/predicted value corresponding to unique=5Fid ? Regards, Sourav On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare wrote: > Hi Sourav, > > Please see below comments: > > >> I was basically hoping for some sort of API where one can pass the > original > data frame and from that dataframe can specify the columns to be used as > feature and the column to be used for label. This model can work well for > both creating the model and getting the prediction. > Please use the most recent jar from git. To extract X and Y from your > dataframe without IDs, use following code: > import > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt > val features =3D Array("lat", "height", "precipitation", "pressure") > val Xmc =3D new MatrixCharacteristics() // SystemML will set them for you if > the dimensions are unknown > val Ymc =3D new MatrixCharacteristics() > val X =3D RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Xmc, features) > val Y =3D RDDConverterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc, > Array("temperature")) > > If you want to add specific ordering to your DataFrame rows (let's say for > prediction ... in most cases it is not required), use following method: > import > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt > df =3D RDDConverterUtilsExt.addIDToDataFrame(df, sqlContext, "ID") > > >> 1. Yes dependent variables are nothing but labels > 2. The values of the dependent variable are not 1 to totalNumOfClasses. The > values can be any double number. For example say in a weather data set you > have fields like lat, long, height (from sea level), precipitation, > pressure, temperature. Now one way you can create a model where Temperature > is the dependent variable and other are features (the hypothesis is > Temperature is some function of pressure, precipitation, height, latitude > and longitude. > Sorry, in this case, please ignore my earlier suggestion of "Prediction = =3D > rowIndexMax(Prob)" as it applies only to classification. > In your case, the returned values are "means" of the distribution family > which was used (See > http://apache.github.io/incubator-systemml/algorithms-regression.html#gener= alized-linear-models ). > If Gaussian distribution was used (dfam=3D1, vpow=3D0.0), and if the prob= lem > was linear and if you expected pointy-hat distribution (i.e. positive > kurtosis), then you can simply return the mean as predicted label. This is > because in case of Gaussian distribution, mean is also the mode. In other > case, it might not necessarily be true. > > You may ask why are we making it so complicated and why not just return > the predicted labels instead of probability ? > Well, the problem of labelling is not as simple as it appears and it > highly depends on the problem setting. Let's consider the problem of > multi-class classification and my earlier suggestion "Prediction =3D > rowIndexMax(Prob)". Also, let the labels be as follows =3D {cancer, sore > throat, birth defect, fever, normal}. If for a given test example, let's > say GLM-predict.dml outputs following probability =3D {cancer: 0.2, sore > throat: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according > to "Prediction =3D rowIndexMax(Prob)", we should output the label "normal" > and send the patient home ... right ? No. In this case, 20% probability of > cancer is just way too high for a doctor to send the patient home. In this > setting, the doctor might then say to the data scientist: I know that based > on the prevalence of cancer in general public, and based on that domain > knowledge, I suggest that probability over "threshold" should always be > flagged as cancer. Else output the label with highest probability. Using > this suggestion, the data scientist modifies the DML as follows: > zeroOneMat =3D ppred(prob[cancerColID], threshold, ">") > prediction =3D zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(prob) > > This also shows the usefulness of "Declarative Machine Learning" :) > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 01:15:30 > PM---Hi Niketan, Firstly to answer your Qs -]Sourav Mazumder > ---12/09/2015 01:15:30 PM---Hi Niketan, Firstly to answer your Qs - > > From: Sourav Mazumder > To: dev@systemml.incubator.apache.org > Date: 12/09/2015 01:15 PM > Subject: Re: Using GLM-predict > ------------------------------ > > > > Hi Niketan, > > Firstly to answer your Qs - > > 1. Yes dependent variables are nothing but labels > 2. The values of the dependent variable are not 1 to totalNumOfClasses. The > values can be any double number. For example say in a weather data set you > have fields like lat, long, height (from sea level), precipitation, > pressure, temperature. Now one way you can create a model where Temperature > is the dependent variable and other are features (the hypothesis is > Temperature is some function of pressure, precipitation, height, latitude > and longitude. > > Not sure about the correlation between step 2 and step 3 in your mail. In > step 3 does one have to pass 'ID' column (created in step 2) to varName > while calling registerInput(String varName, DataFrame df, containsID) ? > > However the unique Id in typical case can be string. Can't that be used as > is instead ? This means one has to first convert the original unique id to > integer to create an additional unique id column and then again later on > that integer unique id has to mapped back. > > I was basically hoping for some sort of API where one can pass the original > data frame and from that dataframe can specify the columns to be used as > feature and the column to be used for label. This model can work well for > both creating the model and getting the prediction. > > Regards, > Sourav > > On Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare > wrote: > > > Hi Sourav, > > > > Couple of questions to make sure we are on same page: does the "dependent > > variable (double)" represents the class labels ? Are the values of the > > class labels from 1 to numClasses (i..e one-based) ? > > > > Here are few comments regarding correlating IDs: > > > > To represent an unordered collection (i.e. DataFrame) to an ordered > > collection ("Matrix"), we add special column "ID" which represents > *one-based > > row index*. Please perform following steps: > > 1. Accept recent changes from > https://github.com/apache/incubator-systemml > > and use the generated jar. > > > > 2. Map the unique id in DF1 to int (*1 to number of rows*) and call that > > column 'ID'. > > > > 3. Use the variant of registerInput for both X (both for training and > > predicting) and Y: > > registerInput(String varName, DataFrame df, *b**oolean* containsID) > > > > As a side note: instead of separate double columns, you can represent > them > > using VectorUDT and use our converter "JavaPairRDD > MatrixBlock> vectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame > > inputDF, MatrixCharacteristics mcOut, *boolean* containsID, String > > vectorColumnName) " > > > > Thanks, > > > > Niketan Pansare > > IBM Almaden Research Center > > E-mail: npansar At us.ibm.com > > http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar > > > > [image: Inactive hide details for Sourav Mazumder ---12/09/2015 11:15:19 > > AM---Hi Niketan, The code you provided works fine. The use of]Sourav > > Mazumder ---12/09/2015 11:15:19 AM---Hi Niketan, The code you provided > > works fine. The use of getMatrixCharacteristics > > > > From: Sourav Mazumder > > To: dev@systemml.incubator.apache.org > > Date: 12/09/2015 11:15 AM > > Subject: Re: Using GLM-predict > > ------------------------------ > > > > > > > > Hi Niketan, > > > > The code you provided works fine. The use of getMatrixCharacteristics > > solves the basic execution problem. > > > > However, question #3 is probably not yet unresolved. Let me explain the > use > > case scenario I'm trying to build. > > > > 1. Say I have a data frame (DF1) with a Unique Id (string), a bunch of > > columns (say 4) which are to be used as features (double), and a column > for > > the dependent variable (double). > > 2. When I created the model I created a data frame (DF2) from DF1 using > > only the feature vectors and pass that as X. And the column with > dependent > > value is passed as Y. > > 3. For calling the GLM-predict I'm using another data frame (DF3) of same > > structure but with different Unique ID (essentially different > > records/rows). From that data frame I'm first creating another data frame > > (DF4) containing the columns representing the features. Then I'm sending > > DF4 to GLM-predict which has only feature vectors. > > 4. The response I get from GLM-predict is the 'means'. Then I'm using the > > inline predict script which returns another data frame {DF5) with ID and > > Predicted values. > > > > The question is how do I correlate the ID I'm getting from DF5 with the > > Unique ID of the data frame DF3 ? > > > > Regards, > > Sourav > > > > > > > > > > On Wed, Dec 9, 2015 at 9:17 AM, Niketan Pansare > > wrote: > > > > > Hi Sourav, > > > > > > 1. In the GLM-predict.dml I could see 'means' is the output variable. > In > > my > > > understanding it is same as the probability matrix u have mentioned in > > your > > > mail (to be used to compute the prediction). Am I right ? > > > Yes, that's correct. > > > > > > 2. From GLM.dml I get the 'betas' as output using > > > outputs.getBinaryBlockedRDD("beta=5Fout"). The same I pass to > > GLM-predict.dml > > > as B. > > > > > > Can you try this ? > > > // Get output from GLM > > > val beta =3D outputs.getBinaryBlockedRDD("beta=5Fout") > > > val betaMC =3D outputs.getMatrixCharacteristics("beta=5Fout") // This= way > you > > > don't have to worry about dimensions. > > > // ----------------------------------------- > > > val Xin =3D DataFrame/RDD of values (or even text/csv file) you want = to > > > predict > > > // ----------------------------------------- > > > // Execute GLM-predict > > > ml.reset() > > > // Please read > > > > > > https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms= /GLM.dml > > > // dfam Int 1 Distribution family code: 1 =3D Power, 2 =3D Binomial > > > val cmdLineParamsPredict =3D Map("X" -> " ", "B" -> " ", "dfam" -> "...") > > // > > > family of distribution ? > > > ml.registerInput("X", Xin) > > > ml.registerInput("B=5Ffull", beta, betaMC) > > > ml.registerOutput("means") > > > val outputsPredict =3D > > > ml.execute ("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml", > > > cmdLineParamsPredict) > > > val prob =3D out.getBinaryBlockedRDD("means"); > > > val probMC =3D out.getMatrixCharacteristics("means"); > > > // ----------------------------------------- > > > // Get predicted label > > > ml.reset() > > > ml.registerInput("Prob",prob, probMC) > > > ml.registerOutput("Prediction") > > > val outputsLabels =3D =3D mlNew.executeScript("Prob =3D read(\"temp1\= "); " > > > + "Prediction =3D rowIndexMax(Prob); " > > > + "write(Prediction, \"tempOut\", \"csv\")") > > > val pred =3D outputsLabels.getDF(sqlContext, > > > "Prediction").withColumnRenamed("C1", "prediction") > > > // ----------------------------------------- > > > > > > > > > 3. Say I get back prediction matrix as an output (from predictions =3D > > > rowIndexMax(means);). Now can I read add that as a column to my > original > > > data frame (the one from which I created the feature vector for the > > > original model) ? My concern is whether adding back will ensure the > right > > > order so that teh key for the feature vector and the predicted value > > remain > > > same ? If not how to achieve the same ? > > > In above example 'pred' is a DataFrame with column 'ID' which provides > > the > > > row ID. > > > > > > Thanks, > > > > > > Niketan Pansare > > > IBM Almaden Research Center > > > E-mail: npansar At us.ibm.com > > > http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar > > > > > > [image: Inactive hide details for Sourav Mazumder ---12/08/2015 > 10:53:40 > > > PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder > > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for the detailed > > > inputs. > > > > > > From: Sourav Mazumder > > > To: dev@systemml.incubator.apache.org, Niketan > Pansare/Almaden/IBM@IBMUS > > > Date: 12/08/2015 10:53 PM > > > Subject: Re: Using GLM-predict > > > ------------------------------ > > > > > > > > > > > > Hi Niketan, > > > > > > Thanks again for the detailed inputs. > > > > > > Some more follow up Qs - > > > > > > 1. In the GLM-predict.dml I could see 'means' is the output variable. > In > > my > > > understanding it is same as the probability matrix u have mentioned in > > your > > > mail (to be used to compute the prediction). Am I right ? > > > > > > 2. From GLM.dml I get the 'betas' as output using > > > outputs.getBinaryBlockedRDD("beta=5Fout"). The same I pass to > > GLM-predict.dml > > > as B. For registering B following statements are used > > > val beta =3D outputs.getBinaryBlockedRDD("beta=5Fout") > > > ml.registerInput("B", beta, 1, 4) // I have four feature vectors so I > > get 4 > > > coefficients > > > > > > However, when I execute GLM-predict.dml I get following error. > > > > > > val outputs =3D > > > ml.execute ("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml", > > > cmdLineParams) > > > > > > 15/12/09 05:32:47 WARN Expression: Metadata file: .mtd not provided > > > 15/12/09 05:32:47 ERROR Expression: ERROR: > > > /home/system-ml-0.9.0-SNAPSHOT/algori > > > thms/GLM-predict.dml -- line 117, column 8 -- Missing or incomplete > > > dimensio > > > n information in read statement: .mtd > > > com.ibm.bi.dml.parser.LanguageException: Invalid Parameters : ERROR: > > > /home/syste > > > m-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 -- > > > Miss > > > ing or incomplete dimension information in read statement: .mtd > > > > > > In line 117 we have following statement : X =3D read (fileX); > > > > > > 3. Say I get back prediction matrix as an output (from predictions =3D > > > rowIndexMax(means);). Now can I read add that as a column to my > original > > > data frame (the one from which I created the feature vector for the > > > original model) ? My concern is whether adding back will ensure the > right > > > order so that teh key for the feature vector and the predicted value > > remain > > > same ? If not how to achieve the same ? > > > > > > Regards, > > > Sourav > > > > > > > > > > > > > > > > > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansare > > > wrote: > > > > > > > Hi Sourav, > > > > > > > > For some reason, I didn't get your email on "*Tue, 08 Dec 2015 > 12:56:38 > > > > -0800* > > > > < > > > > > > https://www.mail-archive.com/search?l=3Ddev@systemml.incubator.apache.org&q= =3Ddate:20151208 > > > > > > " > > > > (which I noticed in the archive). > > > > > > > > >> Not sure how exactly I can modify the GLM-predict.dml to get some > > > > prediction to start with. > > > > There are two options here: > > > > 1. Modify GLM-predict.dml as suggested by Shirish (better approach > with > > > > respect to the SystemML optimizer) or > > > > > > > > 2. Run a new script on the output of GLM-predict. Please see: > > > > > > > > > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/= apache/sysml/api/ml/LogisticRegressionModel.java#L163 > > > > If you chose to go with option 2, you might also want to read the > > > > documentation of following two built-in functions: > > > > a. rowIndexMax (See > > > > > > > > > > http://apache.github.io/incubator-systemml/dml-language-reference.html#matr= ix-andor-scalar-comparison-built-in-functions > > > > < > > > > > > http://apache.github.io/incubator-systemml/dml-language-reference.html#matr= ix-andor-scalar-comparison-built-in-functions > > > > > > > > ) > > > > b. ppred > > > > > > > > >> Can you give me some idea how from here I can calculate the > > predicted > > > > value of the label using some value of probability threshold ? > > > > Very simple way to predict the label given probability matrix: > > > > Prediction =3D rowIndexMax(Prob) # predicts the label with highest > > > > probability. This assumes one-based labels. > > > > > > > > Thanks, > > > > > > > > Niketan Pansare > > > > IBM Almaden Research Center > > > > E-mail: npansar At us.ibm.com > > > > > http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar > > > > > > > > [image: Inactive hide details for Shirish Tatikonda ---12/08/2015 > > > 12:49:47 > > > > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish > > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes, GLM-predict.dml > > > gives > > > > out only the probabilities. You can put a > > > > > > > > From: Shirish Tatikonda > > > > To: dev@systemml.incubator.apache.org > > > > Date: 12/08/2015 12:49 PM > > > > Subject: Re: Using GLM-predict > > > > ------------------------------ > > > > > > > > > > > > > > > > Hi Sourav, > > > > > > > > Yes, GLM-predict.dml gives out only the probabilities. You can put a > > > > threshold on the resulting probabilities to get the actual class > labels > > > -- > > > > for example, prob > 0.5 is positive and <=3D0.5 as negative. > > > > > > > > The exact value of threshold typically depends on the data and the > > > > application. Different thresholds yield different classifiers with > > > > different performance (precision, recall, etc.). You can find the > best > > > > threshold for the given data set by finding a value that gives the > > > desired > > > > classifier performance (for example, a threshold that gives roughly > > equal > > > > precision and recall). Such an optimization is obviously done during > > the > > > > training phase using a held out test set. > > > > > > > > If you wish, you can also modify the DML script to perform this > entire > > > > process. > > > > > > > > Shirish > > > > > > > > > > > > On Tue, Dec 8, 2015 at 12:23 PM, Sourav Mazumder < > > > > sourav.mazumder00@gmail.com> wrote: > > > > > > > > > Hi, > > > > > > > > > > I have used GLM.dml to create a model using some sample data. It > > > returns > > > > to > > > > > me the matrix of Beta, B. > > > > > > > > > > Now I want to use this matrix of Beta on a new set of data points > and > > > > > generate predicted value of the dependent variable/observation. > > > > > > > > > > When I checked GLM-predict, I could see that one can pass feature > > > vector > > > > > for the new data set and also the matrix of beta. > > > > > > > > > > But I could not see any way to get the predicted value of the > > dependent > > > > > variable/observation. The output parameter only supports matrix of > > > > > predicted means/probabilities. > > > > > > > > > > Is there a way one can get the predicted value of the dependent > > > > > variable/observation from GLM-predict ? > > > > > > > > > > Regards, > > > > > Sourav > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --1__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB Content-Transfer-Encoding: quoted-printable Content-type: text/html; charset=US-ASCII Content-Disposition: inline

Hi Sourav,

There are two possible options here:
1.= If "unique=5Fid" is one-based integer column: In this c= ase, please rename "unique=5Fid" column to ID and use registerInput("X", DF1, true) method.

2. If "unique= =5Fid" is anything else (for example: String), then there is no t= rivial way for SystemML to correlate "string-based unique id" to = row index (which is required to interpret a DataFrame into a matrix). This = means you have to explicitly add the column ID to DF1:
val dataset =3D RDD= ConverterUtilsExt.addIDToDataFrame= (DF1, sqlContext, "ID")

When you get DF5 from GLM-predict.dml, you can use followi= ng two lines of code which guarantees correct mapping:
val DF5 =3D = outNew.getDF(sqlContext, "outPred").withColumnRenamed("C1", "pr= ediction") // Note: there alread= y is a column ID in DF5 which specifies the row index.
val output =3D dataset1.join(pred,
datas= et1.col("ID").equalTo(pred.col("ID")))

Note: once DataFrame is passed= to SystemML via registerInput, SystemML first converts the DataFrame into = binary block (i.e. JavaPairRDD<MatrixIndexes, MatrixBlock>) and execu= tes GLM-predict.dml using the binary block. After execution, the output is = present in MLOutput (https:/= /github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/= sysml/api/MLOutput.java#L89) in binary block format. If user choses to,= he/she may call getDF(...) which does DataFrame to binary block conversion= .

For DataFrame to binary block conversion, see https:= //github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache= /sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L277 = ... ordering specified by zipWithIndex (which is also used by RDDConverterUtilsExt.addIDToDa= taFrame)
For binary block to DataFrame conversion, see https://github.com/apache/incubator-systemml/blob/master/src/main/j= ava/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.= java#L364 ... ordering specified by internal binary block format and he= nce we append an extra column ID to specify this ordering.

Thanks,
Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At = us.ibm.com
http://researcher.watson.ibm.com/researcher/view.ph= p?person=3Dus-npansar

3D"In=Sourav Mazumder ---12/09/2015 06:20:24 PM---Hi Niketan, Thanks again for s= uch a detailed explanation. I see your last point and in

From:
Sourav = Mazumder <sourav.mazumder00@gmail.com>
To: dev@systemml.incubator.ap= ache.org
Date: <= font size=3D"2">12/09/2015 06:20 PM

Subject: Re: Using GLM-predict<= br>





Hi Niketan,

Thanks again for such a detailed = explanation. I see your last point and in
agreement with the same. Also = I got your point on the use of "means" for
gaussian vs other d= istributions.

However, I'm still not convinced about the approach yo= u mentioned for
correlating the unique id. I've already tried a code sim= ilar to what you
sent where I've used the vectorAssembler utility of Spa= rk ML LIb.

Let me try to explain the problem with more details -
=
1. Say my original data frame DF1 is distributed in 3 slave nodes in a<= br>Spark cluster. Each has say 20 rows. Total 60 rows. The DF1 also has aunique identifier column say unique=5Fid.
2. Now I used your code to c= reate the feature vector from DF1 and pass it
to GLM-predict. And GLM-pr= edict in turn returns me another data frame (say
DF5) of "means&quo= t; (in this case say prediction). However, the rows of DF5 may
be distri= buted in 4 slave nodes each having say 15 rows. Total 60 rows.
3. Now if= I just add this new data frame (DF5) as additional two columns to
DF1 w= here is the guarantee that for a specific unique=5Fid of DF1 I'm gettingright mean/predicted value corresponding to unique=5Fid ?

Regards,<= br>Sourav



On Wed, Dec 9, 2015 at 4:14 PM, Niketan Pansare &l= t;npansar@us.ibm.com> wrote:

> Hi Sourav,
>
> Plea= se see below comments:
>
> >> I was basically hoping for = some sort of API where one can pass the
> original
> data frame= and from that dataframe can specify the columns to be used as
> feat= ure and the column to be used for label. This model can work well for
&g= t; both creating the model and getting the prediction.
> Please use t= he most recent jar from git. To extract X and Y from your
> dataframe= without IDs, use following code:
> import
> org.apache.sysml.r= untime.instructions.spark.utils.RDDConverterUtilsExt
> val features = =3D Array("lat", "height", "precipitation", &= quot;pressure")
> val Xmc =3D new MatrixCharacteristics() // Sys= temML will set them for you if
> the dimensions are unknown
> v= al Ymc =3D new MatrixCharacteristics()
> val X =3D RDDConverterUtilsE= xt.dataFrameToBinaryBlock(sc, df, Xmc, features)
> val Y =3D RDDConve= rterUtilsExt.dataFrameToBinaryBlock(sc, df, Ymc,
> Array("temper= ature"))
>
> If you want to add specific ordering to your = DataFrame rows (let's say for
> prediction ... in most cases it is no= t required), use following method:
> import
> org.apache.sysml.= runtime.instructions.spark.utils.RDDConverterUtilsExt
> df =3D RDDCon= verterUtilsExt.addIDToDataFrame(df, sqlContext, "ID")
>
= > >> 1. Yes dependent variables are nothing but labels
> 2. = The values of the dependent variable are not 1 to totalNumOfClasses. The> values can be any double number. For example say in a weather data se= t you
> have fields like lat, long, height (from sea level), precipit= ation,
> pressure, temperature. Now one way you can create a model wh= ere Temperature
> is the dependent variable and other are features (t= he hypothesis is
> Temperature is some function of pressure, precipit= ation, height, latitude
> and longitude.
> Sorry, in this case,= please ignore my earlier suggestion of "Prediction =3D
> rowInd= exMax(Prob)" as it applies only to classification.
> In your cas= e, the returned values are "means" of the distribution family
= > which was used (See
>
= http://apache.github.io/incubator-systemml/algorithms-regression.html#gener= alized-linear-models).
> If Gaussian distribution was us= ed (dfam=3D1, vpow=3D0.0), and if the problem
> was linear and if you= expected pointy-hat distribution (i.e. positive
> kurtosis), then yo= u can simply return the mean as predicted label. This is
> because in= case of Gaussian distribution, mean is also the mode. In other
> cas= e, it might not necessarily be true.
>
> You may ask why are we= making it so complicated and why not just return
> the predicted lab= els instead of probability ?
> Well, the problem of labelling is not = as simple as it appears and it
> highly depends on the problem settin= g. Let's consider the problem of
> multi-class classification and my = earlier suggestion "Prediction =3D
> rowIndexMax(Prob)". Al= so, let the labels be as follows =3D {cancer, sore
> throat, birth de= fect, fever, normal}. If for a given test example, let's
> say GLM-pr= edict.dml outputs following probability =3D {cancer: 0.2, sore
> thro= at: 0.15, birth defect: 0.15, fever: 0.2, normal:0.3}. Then according
&g= t; to "Prediction =3D rowIndexMax(Prob)", we should output the la= bel "normal"
> and send the patient home ... right ? No. In= this case, 20% probability of
> cancer is just way too high for a do= ctor to send the patient home. In this
> setting, the doctor might th= en say to the data scientist: I know that based
> on the prevalence o= f cancer in general public, and based on that domain
> knowledge, I s= uggest that probability over "threshold" should always be
>= flagged as cancer. Else output the label with highest probability. Using> this suggestion, the data scientist modifies the DML as follows:
= > zeroOneMat =3D ppred(prob[cancerColID], threshold, ">")> prediction =3D zeroOneMat*cancerColID + (1-zeroOneMat)*rowIndexMax(p= rob)
>
> This also shows the usefulness of "Declarative Ma= chine Learning" :)
>
> Thanks,
>
> Niketan Pan= sare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.= com
>
http://researcher.watson.ibm.com/researcher/= view.php?person=3Dus-npansar
>
> [image: Inactive = hide details for Sourav Mazumder ---12/09/2015 01:15:30
> PM---Hi Nik= etan, Firstly to answer your Qs -]Sourav Mazumder
> ---12/09/2015 01:= 15:30 PM---Hi Niketan, Firstly to answer your Qs -
>
> From: So= urav Mazumder <sourav.mazumder00@gmail.com>
> To: dev@systemml.= incubator.apache.org
> Date: 12/09/2015 01:15 PM
> Subject: Re:= Using GLM-predict
> ------------------------------
>
>>
> Hi Niketan,
>
> Firstly to answer your Qs -
&= gt;
> 1. Yes dependent variables are nothing but labels
> 2. Th= e values of the dependent variable are not 1 to totalNumOfClasses. The
&= gt; values can be any double number. For example say in a weather data set = you
> have fields like lat, long, height (from sea level), precipitat= ion,
> pressure, temperature. Now one way you can create a model wher= e Temperature
> is the dependent variable and other are features (the= hypothesis is
> Temperature is some function of pressure, precipitat= ion, height, latitude
> and longitude.
>
> Not sure about= the correlation between step 2 and step 3 in your mail. In
> step 3 = does one have to pass 'ID' column (created in step 2) to varName
> wh= ile calling registerInput(String varName, DataFrame df, containsID) ?
&g= t;
> However the unique Id in typical case can be string. Can't that = be used as
> is instead ? This means one has to first convert the ori= ginal unique id to
> integer to create an additional unique id column= and then again later on
> that integer unique id has to mapped back.=
>
> I was basically hoping for some sort of API where one can = pass the original
> data frame and from that dataframe can specify th= e columns to be used as
> feature and the column to be used for label= . This model can work well for
> both creating the model and getting = the prediction.
>
> Regards,
> Sourav
>
> On = Wed, Dec 9, 2015 at 12:53 PM, Niketan Pansare <npansar@us.ibm.com>> wrote:
>
> > Hi Sourav,
> >
> > Coup= le of questions to make sure we are on same page: does the "dependent<= br>> > variable (double)" represents the class labels ? Are the = values of the
> > class labels from 1 to numClasses (i..e one-base= d) ?
> >
> > Here are few comments regarding correlating = IDs:
> >
> > To represent an unordered collection (i.e. D= ataFrame) to an ordered
> > collection ("Matrix"), we ad= d special column "ID" which represents
> *one-based
>= > row index*. Please perform following steps:
> > 1. Accept re= cent changes from
>
https://github.com/apache/incubator-systemml<= br>> > and use the generated jar.
> >
> > 2. Map th= e unique id in DF1 to int (*1 to number of rows*) and call that
> >= ; column 'ID'.
> >
> > 3. Use the variant of registerInpu= t for both X (both for training and
> > predicting) and Y:
>= > registerInput(String varName, DataFrame df, *b**oolean* containsID)> >
> > As a side note: instead of separate double columns= , you can represent
> them
> > using VectorUDT and use our c= onverter "JavaPairRDD<MatrixIndexes,
> > MatrixBlock> v= ectorDataFrameToBinaryBlock(JavaSparkContext sc, DataFrame
> > inp= utDF, MatrixCharacteristics mcOut, *boolean* containsID, String
> >= ; vectorColumnName) "
> >
> > Thanks,
> >> > Niketan Pansare
> > IBM Almaden Research Center
>= ; > E-mail: npansar At us.ibm.com
> >
http:/= /researcher.watson.ibm.com/researcher/view.php?person=3Dus-npansar=
> >
> > [image: Inactive hide details for Sourav Maz= umder ---12/09/2015 11:15:19
> > AM---Hi Niketan, The code you pro= vided works fine. The use of]Sourav
> > Mazumder ---12/09/2015 11:= 15:19 AM---Hi Niketan, The code you provided
> > works fine. The u= se of getMatrixCharacteristics
> >
> > From: Sourav Mazum= der <sourav.mazumder00@gmail.com>
> > To: dev@systemml.incub= ator.apache.org
> > Date: 12/09/2015 11:15 AM
> > Subject= : Re: Using GLM-predict
> > ------------------------------
>= >
> >
> >
> > Hi Niketan,
> >
&g= t; > The code you provided works fine. The use of getMatrixCharacteristi= cs
> > solves the basic execution problem.
> >
> &g= t; However, question #3 is probably not yet unresolved. Let me explain the<= br>> use
> > case scenario I'm trying to build.
> >> > 1. Say I have a data frame (DF1) with a Unique Id (string), a bu= nch of
> > columns (say 4) which are to be used as features (doubl= e), and a column
> for
> > the dependent variable (double).<= br>> > 2. When I created the model I created a data frame (DF2) from = DF1 using
> > only the feature vectors and pass that as X. And the= column with
> dependent
> > value is passed as Y.
> &= gt; 3. For calling the GLM-predict I'm using another data frame (DF3) of sa= me
> > structure but with different Unique ID (essentially differe= nt
> > records/rows). From that data frame I'm first creating anot= her data frame
> > (DF4) containing the columns representing the f= eatures. Then I'm sending
> > DF4 to GLM-predict which has only fe= ature vectors.
> > 4. The response I get from GLM-predict is the '= means'. Then I'm using the
> > inline predict script which returns= another data frame {DF5) with ID and
> > Predicted values.
>= ; >
> > The question is how do I correlate the ID I'm getting f= rom DF5 with the
> > Unique ID of the data frame DF3 ?
> >= ;
> > Regards,
> > Sourav
> >
> >
&g= t; >
> >
> > On Wed, Dec 9, 2015 at 9:17 AM, Niketan P= ansare <npansar@us.ibm.com>
> > wrote:
> >
> = > > Hi Sourav,
> > >
> > > 1. In the GLM-pred= ict.dml I could see 'means' is the output variable.
> In
> >= my
> > > understanding it is same as the probability matrix u = have mentioned in
> > your
> > > mail (to be used to c= ompute the prediction). Am I right ?
> > > Yes, that's correct.=
> > >
> > > 2. From GLM.dml I get the 'betas' as o= utput using
> > > outputs.getBinaryBlockedRDD("beta=5Fout&= quot;). The same I pass to
> > GLM-predict.dml
> > > a= s B.
> > >
> > > Can you try this ?
> > &g= t; // Get output from GLM
> > > val beta =3D outputs.getBinaryB= lockedRDD("beta=5Fout")
> > > val betaMC =3D outputs.= getMatrixCharacteristics("beta=5Fout") // This way
> you> > > don't have to worry about dimensions.
> > > // = -----------------------------------------
> > > val Xin =3D Dat= aFrame/RDD of values (or even text/csv file) you want to
> > > = predict
> > > // -----------------------------------------
&= gt; > > // Execute GLM-predict
> > > ml.reset()
> &= gt; > // Please read
> > >
> >
>
https://github.com/apache/incubator-systemml/blob/master/= scripts/algorithms/GLM.dml
> > > // dfam Int 1 Dis= tribution family code: 1 =3D Power, 2 =3D Binomial
> > > val cm= dLineParamsPredict =3D Map("X" -> " ", "B"= -> " ", "dfam" -> "...")
> > = //
> > > family of distribution ?
> > > ml.register= Input("X", Xin)
> > > ml.registerInput("B=5Ffull= ", beta, betaMC)
> > > ml.registerOutput("means"= )
> > > val outputsPredict =3D
> > > ml.execute(&qu= ot;/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-predict.dml",
>= > > cmdLineParamsPredict)
> > > val prob =3D out.getBina= ryBlockedRDD("means");
> > > val probMC =3D out.getMa= trixCharacteristics("means");
> > > // --------------= ---------------------------
> > > // Get predicted label
>= ; > > ml.reset()
> > > ml.registerInput("Prob",= prob, probMC)
> > > ml.registerOutput("Prediction")> > > val outputsLabels =3D =3D mlNew.executeScript("Prob = =3D read(\"temp1\"); "
> > > + "Prediction = =3D rowIndexMax(Prob); "
> > > + "write(Prediction, \= "tempOut\", \"csv\")")
> > > val pred = =3D outputsLabels.getDF(sqlContext,
> > > "Prediction"= ;).withColumnRenamed("C1", "prediction")
> > &= gt; // -----------------------------------------
> > >
> = > >
> > > 3. Say I get back prediction matrix as an outpu= t (from predictions =3D
> > > rowIndexMax(means);). Now can I r= ead add that as a column to my
> original
> > > data fram= e (the one from which I created the feature vector for the
> > >= ; original model) ? My concern is whether adding back will ensure the
&g= t; right
> > > order so that teh key for the feature vector and= the predicted value
> > remain
> > > same ? If not ho= w to achieve the same ?
> > > In above example 'pred' is a Data= Frame with column 'ID' which provides
> > the
> > > ro= w ID.
> > >
> > > Thanks,
> > >
>= > > Niketan Pansare
> > > IBM Almaden Research Center> > > E-mail: npansar At us.ibm.com
> > >
http://researcher.watson.ibm.com/researcher/view.php?person=3Dus-np= ansar
> > >
> > > [image: Inactive hid= e details for Sourav Mazumder ---12/08/2015
> 10:53:40
> > &= gt; PM---Hi Niketan, Thanks again for the detailed inputs.]Sourav Mazumder<= br>> > > ---12/08/2015 10:53:40 PM---Hi Niketan, Thanks again for = the detailed
> > > inputs.
> > >
> > > = From: Sourav Mazumder <sourav.mazumder00@gmail.com>
> > >= To: dev@systemml.incubator.apache.org, Niketan
> Pansare/Almaden/IBM= @IBMUS
> > > Date: 12/08/2015 10:53 PM
> > > Subjec= t: Re: Using GLM-predict
> > > ------------------------------> > >
> > >
> > >
> > > Hi N= iketan,
> > >
> > > Thanks again for the detailed i= nputs.
> > >
> > > Some more follow up Qs -
>= > >
> > > 1. In the GLM-predict.dml I could see 'means' = is the output variable.
> In
> > my
> > > unders= tanding it is same as the probability matrix u have mentioned in
> &g= t; your
> > > mail (to be used to compute the prediction). Am I= right ?
> > >
> > > 2. From GLM.dml I get the 'bet= as' as output using
> > > outputs.getBinaryBlockedRDD("bet= a=5Fout"). The same I pass to
> > GLM-predict.dml
> >= ; > as B. For registering B following statements are used
> > &= gt; val beta =3D outputs.getBinaryBlockedRDD("beta=5Fout")
>= ; > > ml.registerInput("B", beta, 1, 4) // I have four feat= ure vectors so I
> > get 4
> > > coefficients
> = > >
> > > However, when I execute GLM-predict.dml I get f= ollowing error.
> > >
> > > val outputs =3D
>= > > ml.execute("/home/system-ml-0.9.0-SNAPSHOT/algorithms/GLM-p= redict.dml",
> > > cmdLineParams)
> > >
>= ; > > 15/12/09 05:32:47 WARN Expression: Metadata file:  .mtd no= t provided
> > > 15/12/09 05:32:47 ERROR Expression: ERROR:
= > > > /home/system-ml-0.9.0-SNAPSHOT/algori
> > > thms= /GLM-predict.dml -- line 117, column 8 -- Missing or incomplete
> >= ; > dimensio
> > > n information in read statement:  .m= td
> > > com.ibm.bi.dml.parser.LanguageException: Invalid Param= eters : ERROR:
> > > /home/syste
> > > m-ml-0.9.0-S= NAPSHOT/algorithms/GLM-predict.dml -- line 117, column 8 --
> > &g= t; Miss
> > > ing or incomplete dimension information in read s= tatement:  .mtd
> > >
> > > In line 117 we hav= e following statement : X =3D read (fileX);
> > >
> > = > 3. Say I get back prediction matrix as an output (from predictions =3D=
> > > rowIndexMax(means);). Now can I read add that as a colum= n to my
> original
> > > data frame (the one from which I= created the feature vector for the
> > > original model) ? My = concern is whether adding back will ensure the
> right
> > &= gt; order so that teh key for the feature vector and the predicted value> > remain
> > > same ? If not how to achieve the same ?=
> > >
> > > Regards,
> > > Sourav
&= gt; > >
> > >
> > >
> > >
>= > >
> > > On Tue, Dec 8, 2015 at 2:08 PM, Niketan Pansar= e <npansar@us.ibm.com>
> > > wrote:
> > >
= > > > > Hi Sourav,
> > > >
> > > >= ; For some reason, I didn't get your email on "*Tue, 08 Dec 2015
&g= t; 12:56:38
> > > > -0800*
> > > > <
&g= t; > >
> >
>
htt= ps://www.mail-archive.com/search?l=3Ddev@systemml.incubator.apache.org&= q=3Ddate:20151208
> > >
> > > "> > > > (which I noticed in the archive).
> > > &= gt;
> > > > >> Not sure how exactly I can modify the G= LM-predict.dml to get some
> > > > prediction to start with.=
> > > > There are two options here:
> > > > = 1. Modify GLM-predict.dml as suggested by Shirish (better approach
> = with
> > > > respect to the SystemML optimizer) or
> &= gt; > >
> > > > 2. Run a new script on the output of G= LM-predict. Please see:
> > > >
> > >
> &g= t;
>
https://github.com/apache/incubator-systemml/blob/master/src/mai= n/java/org/apache/sysml/api/ml/LogisticRegressionModel.java#L163
> > > > If you chose to go with option 2, you might also = want to read the
> > > > documentation of following two buil= t-in functions:
> > > > a. rowIndexMax (See
> > >= ; >
> > >
> >
> http://apache.github.io/incubator-sys= temml/dml-language-reference.html#matrix-andor-scalar-comparison-built-in-f= unctions
> > > > <
> > >
>= >
>
http://apache.github.io/incubator-systemml/dml-language-reference.htm= l#matrix-andor-scalar-comparison-built-in-functions
> &g= t; > >
> > > > )
> > > > b. ppred
&g= t; > > >
> > > > >> Can you give me some idea= how from here I can calculate the
> > predicted
> > >= > value of the label using some value of probability threshold ?
>= ; > > > Very simple way to predict the label given probability mat= rix:
> > > > Prediction =3D rowIndexMax(Prob) # predicts the= label with highest
> > > > probability. This assumes one-ba= sed labels.
> > > >
> > > > Thanks,
> &= gt; > >
> > > > Niketan Pansare
> > > >= IBM Almaden Research Center
> > > > E-mail: npansar At us.i= bm.com
> > > >
>
http://researcher.= watson.ibm.com/researcher/view.php?person=3Dus-npansar
>= > > >
> > > > [image: Inactive hide details for Sh= irish Tatikonda ---12/08/2015
> > > 12:49:47
> > > = > PM---Hi Sourav, Yes, GLM-predict.dml gives out only the prob]Shirish> > > > Tatikonda ---12/08/2015 12:49:47 PM---Hi Sourav, Yes,= GLM-predict.dml
> > > gives
> > > > out only th= e probabilities. You can put a
> > > >
> > > >= ; From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> > = > > To: dev@systemml.incubator.apache.org
> > > > Date= : 12/08/2015 12:49 PM
> > > > Subject: Re: Using GLM-predict=
> > > > ------------------------------
> > > &g= t;
> > > >
> > > >
> > > > Hi = Sourav,
> > > >
> > > > Yes, GLM-predict.dml = gives out only the probabilities. You can put a
> > > > thre= shold on the resulting probabilities to get the actual class
> labels=
> > > --
> > > > for example, prob > 0.5 is = positive and <=3D0.5 as negative.
> > > >
> > &g= t; > The exact value of threshold typically depends on the data and the<= br>> > > > application. Different thresholds yield different cl= assifiers with
> > > > different performance (precision, rec= all, etc.). You can find the
> best
> > > > threshold = for the given data set by finding a value that gives the
> > > = desired
> > > > classifier performance (for example, a thres= hold that gives roughly
> > equal
> > > > precision= and recall). Such an optimization is obviously done during
> > th= e
> > > > training phase using a held out test set.
> = > > >
> > > > If you wish, you can also modify the = DML script to perform this
> entire
> > > > process.> > > >
> > > > Shirish
> > > >=
> > > >
> > > > On Tue, Dec 8, 2015 at 12:23= PM, Sourav Mazumder <
> > > > sourav.mazumder00@gmail.co= m> wrote:
> > > >
> > > > > Hi,
>= > > > >
> > > > > I have used GLM.dml to cre= ate a model using some sample data. It
> > > returns
> &g= t; > > to
> > > > > me the matrix of Beta, B.
&g= t; > > > >
> > > > > Now I want to use this m= atrix of Beta on a new set of data points
> and
> > > >= ; > generate predicted value of the dependent variable/observation.
&= gt; > > > >
> > > > > When I checked GLM-pred= ict, I could see that one can pass feature
> > > vector
>= > > > > for the new data set and also the matrix of beta.
&= gt; > > > >
> > > > > But I could not see any= way to get the predicted value of the
> > dependent
> > = > > > variable/observation. The output parameter only supports mat= rix of
> > > > > predicted means/probabilities.
> &= gt; > > >
> > > > > Is there a way one can get t= he predicted value of the dependent
> > > > > variable/ob= servation from GLM-predict ?
> > > > >
> > > = > > Regards,
> > > > > Sourav
> > > >= ; >
> > > >
> > > >
> > > >=
> > >
> > >
> > >
> >
>= >
> >
>
>
>


--1__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB-- --0__=8FBBF584DF8FE3CB8f9e8a93df938690918c8FBBF584DF8FE3CB--