From user-return-60322-apmail-spark-user-archive=spark.apache.org@spark.apache.org Thu Jul 28 11:22:58 2016 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF9EB190C9 for ; Thu, 28 Jul 2016 11:22:58 +0000 (UTC) Received: (qmail 67377 invoked by uid 500); 28 Jul 2016 11:22:54 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 67243 invoked by uid 500); 28 Jul 2016 11:22:54 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 67233 invoked by uid 99); 28 Jul 2016 11:22:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jul 2016 11:22:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0EB161A54B8 for ; Thu, 28 Jul 2016 11:22:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1 X-Spam-Level: * X-Spam-Status: No, score=1 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id WSJoV2NVQ89y for ; Thu, 28 Jul 2016 11:22:53 +0000 (UTC) Received: from mail-fw.wtccommunications.ca (mail-fw.wtccommunications.ca [66.102.92.155]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTP id 13A4260D21 for ; Thu, 28 Jul 2016 11:22:53 +0000 (UTC) X-ASG-Debug-ID: 1469704965-08fa03111522eb1d0001-FB90LT Received: from mail2.kingston.net (mail2.kingston.net [66.102.92.5]) by mail-fw.wtccommunications.ca with ESMTP id lmVL0DUG1cBgQHEP for ; Thu, 28 Jul 2016 07:22:45 -0400 (EDT) X-Barracuda-Envelope-From: colbec@kingston.net X-Barracuda-Effective-Source-IP: mail2.kingston.net[66.102.92.5] X-Barracuda-Apparent-Source-IP: 66.102.92.5 Received: from [192.168.0.111] (dsl-rb-64-118-16-130.wtccommunications.ca [64.118.16.130]) (authenticated bits=0) by mail2.kingston.net (8.14.4/8.14.4/Debian-4) with ESMTP id u6SBMhhm020525 for ; Thu, 28 Jul 2016 07:22:45 -0400 From: Colin Beckingham Subject: Re: Run times for Spark 1.6.2 compared to 2.1.0? To: user X-ASG-Orig-Subj: Re: Run times for Spark 1.6.2 compared to 2.1.0? References: <43530a61-90ad-2bc6-1b61-3063e29fa235@kingston.net> Message-ID: <6f78a61c-d662-456f-2ae7-8f524943ebd4@kingston.net> Date: Thu, 28 Jul 2016 07:22:38 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2 MIME-Version: 1.0 In-Reply-To: <43530a61-90ad-2bc6-1b61-3063e29fa235@kingston.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mail2.kingston.net[66.102.92.5] X-Barracuda-Start-Time: 1469704965 X-Barracuda-URL: https://mail-fw.wtccommunications.ca:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 1305 X-Virus-Scanned: by bsmtpd at wtccommunications.ca X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=7.0 KILL_LEVEL=1000.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.31576 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- On 27/07/16 16:31, Colin Beckingham wrote: > I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It > calculates a logistic model using MLlib. I compiled the 2.1 today from > source and took the version 1 as a precompiled version with Hadoop. > The odd thing is that on 1.6.2 the project produces an answer in 350 > sec and the 2.1.0 takes 990 sec. Identical code using pyspark. I'm > wondering if there is something in the setup params for 1.6 and 2.1, > say number of executors or memory allocation, which might account for > this? I'm using just the 4 cores of my machine as master and executors. FWIW I have a bit more information. Watching the jobs as Spark runs I can see that when performing the logistic regression in Spark 1.6.2 the PySpark call "LogisticRegressionWithLBFGS.train()" runs "treeAggregate at LBFGS.scala:218" but the same command in pyspark with Spark 2.1 runs "treeAggregate at LogisticRegression.scala:1092". This last command takes about 3 times longer to run than the LBFGS version, and there are way more of these calls, and the result is considerably less accurate than the LBFGS. The rest of the process seems to be pretty close. So Spark 2.1 does not seem to be running an optimized version of logistic regression algorithm? --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscribe@spark.apache.org