From dev-return-86-apmail-systemml-dev-archive=systemml.apache.org@systemml.incubator.apache.org Tue Dec 8 08:34:45 2015 Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 51E5A10060 for ; Tue, 8 Dec 2015 08:34:45 +0000 (UTC) Received: (qmail 28247 invoked by uid 500); 8 Dec 2015 08:34:45 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 28195 invoked by uid 500); 8 Dec 2015 08:34:45 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 28182 invoked by uid 99); 8 Dec 2015 08:34:44 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Dec 2015 08:34:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 74B1DC0295 for ; Tue, 8 Dec 2015 08:34:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.881 X-Spam-Level: ** X-Spam-Status: No, score=2.881 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id m7Yfw2RRvWJi for ; Tue, 8 Dec 2015 08:34:34 +0000 (UTC) Received: from mail-vk0-f46.google.com (mail-vk0-f46.google.com [209.85.213.46]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 3D93D42ADE for ; Tue, 8 Dec 2015 08:34:34 +0000 (UTC) Received: by vkca188 with SMTP id a188so7714309vkc.0 for ; Tue, 08 Dec 2015 00:29:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=6WMtuMHDUsOcUyJ8h6TvsvRrDqn/dhgiXAf4YKplMN0=; b=HByPFSgnCaudyJR0CIp/Hc9Jfv9RM1FozZw/7WS7vzBRtfg4mmBlUsn13cCAyH7lZb U2EmIY5CKJknSoNZcnnCi8i1zC8PfcyUk/mtYQXYX5mZjA0DrRq0FLUjdQfQAE7F6aK+ SHcD5xlbMC0QOqZgJIBfKB408NMuiO1V7X4Sib8zPT6AYs5KKcoe8aD1/34v5RrSDBno 6d1rl6g89mQkc1nGbdi/usLXSJih+aZQnWDcUIpgCavAqvEwH5afz8eIuhfhNrNb3UwI COdc7hxHA4mLFqPnywOUX2U5L9LgPdgO+vCqczqALOShLjX118eq2/jMqK30lAF3A7LR rnrQ== MIME-Version: 1.0 X-Received: by 10.31.5.139 with SMTP id 133mr1529106vkf.157.1449563373393; Tue, 08 Dec 2015 00:29:33 -0800 (PST) Received: by 10.31.108.208 with HTTP; Tue, 8 Dec 2015 00:29:33 -0800 (PST) Received: by 10.31.108.208 with HTTP; Tue, 8 Dec 2015 00:29:33 -0800 (PST) In-Reply-To: References: Date: Tue, 8 Dec 2015 00:29:33 -0800 Message-ID: Subject: Re: Using GLM with Spark From: Shirish Tatikonda To: dev@systemml.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1143d888f35c7a05265ec5fb --001a1143d888f35c7a05265ec5fb Content-Type: text/plain; charset=UTF-8 Hi Saurav, Just to add to Niketan's response, you can find a utility DML script to split a data set into X and Y at [1]. This obviously is useful only if you have one unified data set with both X and Y. [1] https://github.com/apache/incubator-systemml/blob/master/scripts/utils/splitXY.dml Shirish On Dec 7, 2015 11:11 PM, "Niketan Pansare" wrote: > Hi Sourav, > > Your understanding is correct, X and Y can be supplied either as a file or > as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The > former mechanism (i.e. passing as file) pushes the reading/reblocking into > the optimizer, while the latter mechanism allows for preprocessing of data > (for example: using Spark SQL). > > Two use-cases when X and Y are supplied as files on HDFS: > 1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master .... > SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001 > tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y > B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log > > 2. Using MLContext but without registering X and Y as input. Instead we > pass filenames as command-line parameters: > > val ml = new MLContext(sc) > > val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam" > -> "2", "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > As mentioned earlier, X and Y can be provided as RDD/DataFrame as well. > > val ml = new MLContext(sc) > > ml.registerInput("X", xDF) > > ml.registerInput("Y", yDF) > > val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2", > "link" -> "2", ...) > > ml.execute("GLM.dml", cmdLineParams) > > One important thing that I must point is the concept of "ifdef". It is > explained in the section > http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments. > > Here is snippet from the DML script for GLM: > https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml > fileX = $X; > fileY = $Y; > fileO = ifdef ($O, " "); > fmtB = ifdef ($fmt, "text"); > distribution_type = ifdef ($dfam, 1); > > The above DML code essentially says $X and $Y are required parameters (a > design decision that GLM script writer made), whereas $fmt and $dfam are > optional as they are assigned default values when not explicitly provided. > Both these constructs are important tools in the arsenal of DML script > writer. By not guarding a dollar parameter with ifdef, the DML script > writer ensures that the user has to provide its value (in this case file > names for X and Y). This is why, you will notice that I have provide a > space for X, Y and B in the second MLContext snippet. > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > [image: Inactive hide details for Sourav Mazumder ---12/07/2015 07:30:06 > PM---Hi, Trying to use GLM with Spark.]Sourav Mazumder ---12/07/2015 > 07:30:06 PM---Hi, Trying to use GLM with Spark. > > From: Sourav Mazumder > To: dev@systemml.incubator.apache.org > Date: 12/07/2015 07:30 PM > Subject: Using GLM with Spark > ------------------------------ > > > > Hi, > > Trying to use GLM with Spark. > > I go through the documentation of the same in > > http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models > I see that inputs like X and Y have to supplied using a file and the file > has to be there in HDFS. > > Is this understanding correct ? Can't X and Y be supplied using a Data > Frame from a Spark Context (as in case of example of LinearRegression in > > http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm > ) > ? > > Regards, > Sourav > > > --001a1143d888f35c7a05265ec5fb--