From dev-return-2445-apmail-systemml-dev-archive=systemml.apache.org@systemml.apache.org Tue Apr 24 03:58:20 2018 Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69EF218D02 for ; Tue, 24 Apr 2018 03:58:20 +0000 (UTC) Received: (qmail 68242 invoked by uid 500); 24 Apr 2018 03:58:20 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 68180 invoked by uid 500); 24 Apr 2018 03:58:20 -0000 Mailing-List: contact dev-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list dev@systemml.apache.org Received: (qmail 68168 invoked by uid 99); 24 Apr 2018 03:58:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Apr 2018 03:58:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E21C61A2134 for ; Tue, 24 Apr 2018 03:58:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.148 X-Spam-Level: X-Spam-Status: No, score=0.148 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 9uwE0UKLnmkv for ; Tue, 24 Apr 2018 03:58:16 +0000 (UTC) Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9FC2A5F238 for ; Tue, 24 Apr 2018 03:58:16 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id x204so10775308vkd.7 for ; Mon, 23 Apr 2018 20:58:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc :content-transfer-encoding; bh=AT5raTY+beA8wLvy2SW/i+qTvgwTXZA/2xXlpI9qV1A=; b=ILbTqPwRllk5vuGVsERTKwoCT5V2PPXc5kOwqPGhcC5ZQQS/Q0gfThUDroj39QCixT 2JpoJL6EaxMBndOk+3afADzYG8PuGDVlGV6xskoERhzBVnTmX+iUDJmllt4jFVhRcIQD H1h8kidmcE7tHggKBAtfF49AABo/GyUfgITLm7w+2dIgM9luMqZvWPzrTgf727l3H1JP Ositt6BPAACOKst67N0LQ0YtSJfCw87+O1uWbIkCxF6dve5ClfyeMeiuQbHO6lLQcsFd QLBT1FCJlKMo89qn78jLhjaSYGknTBWekfytS94+4EHcbBynAuOoIjKyv5dF2eB/+5Fd PjOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc :content-transfer-encoding; bh=AT5raTY+beA8wLvy2SW/i+qTvgwTXZA/2xXlpI9qV1A=; b=aFq9cZMSkAlc+JcUafCY00vqgFtMD0fFeWvN3oTsjwblpGCRvpndRL8SAp7S5xVbwg MlVRoFnExODYyCVPvURyqHAcpt2eQqz7Fq9xaUj9Z8fO1e3mvA21l42h2gnrUiyZ9YuG butgSG1sIjnO83YLCwI/RRhRy/1Ou8SHGeadFex8z4n0IHA/yb/jis/2UZmIpHujEeFm cJ6i8Jj64vLmF768U3e81ujdz2cVB6Ry1XnwyMRgEoGv1jBs0+SXPxLsPNd21vOQXzKP ZLU7jJFFfFGE0ALp4bQUtTIHJgkGeFHKEpqPcdHEW0lPuF36oupmfpgdF9EeBw2hZlv4 y5rA== X-Gm-Message-State: ALQs6tCtKfHlf/EBH7GRcAMGltIF1K+ByU5qW06AuPtTPj+4jxIDXHzk 9r1Mp+G/mOW7ztdjjAFSVdfM2SjJTpviPGOPnpD/ X-Google-Smtp-Source: AB8JxZpERfHPMCYB7fxYpPdRLg66Yj3pySq8b2clMxoWKgMphdoeuUQ5j3kkhzvYS0bfS3WqHGDOGYHlcRWSZXQwZRY= X-Received: by 10.31.129.14 with SMTP id c14mr530997vkd.121.1524542295953; Mon, 23 Apr 2018 20:58:15 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.76.65 with HTTP; Mon, 23 Apr 2018 20:58:15 -0700 (PDT) From: Matthias Boehm Date: Mon, 23 Apr 2018 20:58:15 -0700 Message-ID: Subject: Re: distributed cholesky on systemml To: dev@systemml.apache.org Cc: Qifan Pu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable you're welcome. Let me just make some additional remarks for other people who might come over this thread. First, please don't extrapolate characteristics from an experimental script and unoptimized operations such as cholesky to SystemML in general. Second, for compute-intensive algorithms, it's usually a good idea for performance to leverage native BLAS libraries and our GPU backend which are still disabled by default. Third, while data-parallel operations indeed translate to individual RDD operations, there are also language constructs like parfor (parallel for loops), where the entire loop with all its iterations, nested loops and function calls may be mapped to a single Spark job. Regards, Matthias On Mon, Apr 23, 2018 at 12:42 AM, Qifan Pu wrote: > You are right. Switching to Spark mode makes it even much slower... > It seems that each operator will generate a Spark task converting things > into RDD operators. > Thanks so much for the patience and detailed instructions. I have a much > better understanding of the system now. > > On Sun, Apr 22, 2018 at 7:47 PM, Matthias Boehm wrote= : >> >> well, SystemML decides the execution type of each operation based on >> its worst-case memory estimate and the available driver memory budget >> to avoid unnecessary overheads for distributed operations. Maybe the >> size of the matrix that is fed into cholesky is smaller than the >> intermediates used for data generation? Btw, this is a characteristic >> we often see for other scripts (e.g., t(X)%*%X in LinearRegDS >> potentially runs over large matrices whereas the subsequent solve only >> works over a features-by-features matrix) - this was the reason why >> distributed operations for builtin functions like solve and cholesky >> had low priority in the past. >> >> If you're interested in understanding the execution plan of your >> scenarios better, you can use '-explain' or '-explain hops' to see the >> memory budgets, estimates, and selected execution types. Also, if you >> want to force all operations to spark, you can do so via '-exec >> spark'. However, note that this should be done only for debugging >> because forcing all operations to spark can be very counterproductive >> for performance. >> >> Regards, >> Matthias >> >> On Sun, Apr 22, 2018 at 5:07 PM, Qifan Pu wrote: >> > and everything before (e.g., I generate the matrix using another DML) >> > was >> > indeed run by Spark and shows up on the UI. >> > >> > On Sun, Apr 22, 2018 at 5:05 PM, Qifan Pu wrote: >> >> >> >> Thanks Jeremy and Matthias. When I run the script, the cholesky or th= e >> >> inv >> >> is executed completely on the driver, and nothing shows up on Spark U= I. >> >> Is that the expected behavior? >> >> >> >> On Sun, Apr 22, 2018 at 3:34 PM, Jeremy Nilmeier >> >> wrote: >> >>> >> >>> Yes, I also spoke with Sasha about this some time last year. Thanks >> >>> for >> >>> following up. >> >>> >> >>> Cheers, J >> >>> >> >>> >> >>> Jerome Nilmeier, PhD >> >>> Data Scientist and Engineer >> >>> IBM Spark Technology Center >> >>> http://www.spark.tc/ >> >>> >> >>> >> >>> >> >>> ----- Original message ----- >> >>> From: Matthias Boehm >> >>> To: dev@systemml.apache.org >> >>> Cc: Qifan Pu , Jeremy Nilmeier >> >>> >> >>> Subject: Re: distributed cholesky on systemml >> >>> Date: Sun, Apr 22, 2018 2:41 PM >> >>> >> >>> thanks for the context Jeremy - that helps. I also had an offline >> >>> conversion with Sasha and he pointed me to a script that does exactl= y >> >>> that (iterative invert_lower_triangular) combined with a parfor over >> >>> independent blocks. We'll merge these scripts soon and I'll reach ou= t >> >>> individually as necessary. Thanks everybody for now. >> >>> >> >>> Regards, >> >>> Matthias >> >>> >> >>> On Sun, Apr 22, 2018 at 12:40 PM, Jeremy Nilmeier >> >>> >> >>> wrote: >> >>> > This may be a duplicate...it was bounced from the dev list. >> >>> > >> >>> > I think that scalable triangular inverse will also have similar >> >>> > properties, >> >>> > in that there is a sequential approach if it uses back substitutio= n. >> >>> > >> >>> > For most of these algorithms (LU, Cholesky, QR), they are inherent= ly >> >>> > sequential, and the focus of the work is on minimizing interproces= s >> >>> > communication during the operations, which may explain why there w= as >> >>> > only >> >>> > limited interest in pursuing this further. >> >>> > >> >>> > I had originally recommended that the recursive algorithms be >> >>> > rewritten >> >>> > as >> >>> > iterative algorithms (and in fact provided an example of the LU in >> >>> > iterative >> >>> > form), which would make the counting of operations more transparen= t, >> >>> > as >> >>> > well >> >>> > as revealing possible parallelization points. >> >>> > >> >>> > Cheers, J >> >>> > Jerome Nilmeier, PhD >> >>> > Data Scientist and Engineer >> >>> > IBM Spark Technology Center >> >>> > >> >>> > >> >>> > https://urldefense.proofpoint.com/v2/url?u=3Dhttp-3A__www.spark.tc= _&d=3DDwIFaQ&c=3Djf_iaSHvJObTbx-siA1ZOg&r=3D3mYOfURw_FSirAnoSv2pWvLSi1psso4= F9RdGjEWL6yc&m=3DVIdNVaIRvibBlaNVAOXLKmxXf7ma-EXrLWbjMd9Bmgo&s=3DYktpBBbqor= 3DKzS90Ah75BF6NBYtE4RauITF7QaL87g&e=3D >> >>> >> >>> > >> >>> > >> >>> > >> >>> > ----- Original message ----- >> >>> > From: Matthias Boehm >> >>> > To: dev@systemml.apache.org >> >>> > Cc: Qifan Pu >> >>> > Subject: Re: distributed cholesky on systemml >> >>> > Date: Sun, Apr 22, 2018 1:21 AM >> >>> > >> >>> > sure no problem - thanks again for catching this issue that was >> >>> > hidden >> >>> > for a while. >> >>> > >> >>> > Yes, the same depth-first characteristic applies to the Cholesky >> >>> > function as well. In contrast to U_triangular_inv, however, there >> >>> > are >> >>> > data dependencies between the blocks per level (at least in the >> >>> > current algorithm formulation), which means we cannot use the >> >>> > approach >> >>> > I described for U_triangular_inv. >> >>> > >> >>> > L11 =3D Cholesky(A11, nb) >> >>> > A22 =3D ... U_triangular_inv(t(L11)) >> >>> > L22 =3D Cholesky(A22, nb) >> >>> > >> >>> > However, note that there are much fewer calls to Cholesky due to t= he >> >>> > switch to the builtin cholesky according to the given min block >> >>> > size. >> >>> > For example, in our new test for dimensions 1362 x 1362 and min si= ze >> >>> > of 200, we call Cholesky 15 times but U_triangular_inv 2539 times. >> >>> > >> >>> > For sufficiently large min block size this might be ok for Cholesk= y, >> >>> > because each level also does a number of matrix multiplies that wi= ll >> >>> > exploit the available parallelism of your cluster. In that regard. >> >>> > you >> >>> > might want to experiment with different block sizes and driver >> >>> > memory >> >>> > budgets. If I get a chance, I will also run a number of experiment= s >> >>> > and see if we can rewrite these scripts. >> >>> > >> >>> > Regards, >> >>> > Matthias >> >>> > >> >>> > On Sun, Apr 22, 2018 at 12:48 AM, Qifan Pu >> >>> > wrote: >> >>> >> Matthias, >> >>> >> >> >>> >> Thanks so much for taking time to fix. Really appreciated it. >> >>> >> Does the same reasoning apply to the cholesky script? The recursi= ve >> >>> >> approach >> >>> >> also looks inherently sequential. >> >>> >> >> >>> >> Best, >> >>> >> Qifan >> >>> >> >> >>> >> On Sat, Apr 21, 2018 at 11:39 PM, Matthias Boehm >> >>> >> >> >>> >> wrote: >> >>> >>> >> >>> >>> just as a quick update: this issue has now been fixed in SystemM= L >> >>> >>> master - it was essentially a missing guard for recursive >> >>> >>> functions >> >>> >>> when checking for unary size-preserving functions during >> >>> >>> inter-procedural analysis (IPA). >> >>> >>> >> >>> >>> However, while working with this recursive cholesky function I >> >>> >>> came >> >>> >>> to >> >>> >>> the conclusion that it may need some rework. The current top-dow= n, >> >>> >>> depth-first, approach is inherently sequential. This is partiall= y >> >>> >>> unnecessary because for the used recursive function >> >>> >>> U_triangular_inv >> >>> >>> (which is called many more times than cholesky), blocks per leve= l >> >>> >>> are >> >>> >>> independent. Therefore, we should look into a bottom-up, >> >>> >>> breadth-first >> >>> >>> approach to parallelize over the blocks in each level, which cou= ld >> >>> >>> be >> >>> >>> done via parfor at script level. >> >>> >>> >> >>> >>> Regards, >> >>> >>> Matthias >> >>> >>> >> >>> >>> On Sat, Apr 21, 2018 at 6:59 PM, Matthias Boehm >> >>> >>> >> >>> >>> wrote: >> >>> >>> > thanks for catching this - I just ran a toy example and this >> >>> >>> > seems >> >>> >>> > to >> >>> >>> > be a rewrite issue (there are specific right indexing rewrites >> >>> >>> > that >> >>> >>> > collapse U[1:k,1:k] and U[1:k,k+1:n] into a single access to U >> >>> >>> > which >> >>> >>> > helps for large distributed matrices). As a workaround, you ca= n >> >>> >>> > set >> >>> >>> > "sysml.optlevel" to 1 (instead of default 2, where 1 disables >> >>> >>> > all >> >>> >>> > rewrites), which worked fine for me. I'll fix this later today= . >> >>> >>> > Also >> >>> >>> > I'll fix the naming from "Choleskey" to "Cholesky". Thanks >> >>> >>> > again. >> >>> >>> > >> >>> >>> > Regards, >> >>> >>> > Matthias >> >>> >>> > >> >>> >>> > >> >>> >>> > On Sat, Apr 21, 2018 at 6:28 PM, Qifan Pu >> >>> >>> > wrote: >> >>> >>> >> Hi Matthias, >> >>> >>> >> >> >>> >>> >> Thanks for the fast response and detailed information. This i= s >> >>> >>> >> really >> >>> >>> >> helpful. >> >>> >>> >> >> >>> >>> >> I just tried to run it, and was tracing down a indexing bug >> >>> >>> >> that >> >>> >>> >> can >> >>> >>> >> be >> >>> >>> >> repeated by simply running the test script of triangle solve[= 1] >> >>> >>> >> Caused by: org.apache.sysml.runtime.DMLRuntimeException: >> >>> >>> >> Invalid >> >>> >>> >> values >> >>> >>> >> for >> >>> >>> >> matrix indexing: [1667:3333,1:1666] must be within matrix >> >>> >>> >> dimensions >> >>> >>> >> [1000,1000] >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> Am I missing some configuration here? >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> [1] >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__github= .com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_test_tes= t-5Ftriangular-5Finv.dml&d=3DDwIBaQ&c=3Djf_iaSHvJObTbx-siA1ZOg&r=3D3mYOfURw= _FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=3DFvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kL= ShXcJxqQ&s=3DzIEgt74yeZzCTqvLCgV_0J8ECApG541uUlbaGMcK8bs&e=3D >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> Best, >> >>> >>> >> Qifan >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> On Sat, Apr 21, 2018 at 4:06 PM, Matthias Boehm >> >>> >>> >> >> >>> >>> >> wrote: >> >>> >>> >>> >> >>> >>> >>> Hi Qifan, >> >>> >>> >>> >> >>> >>> >>> thanks for your feedback. You're right, the builtin function= s >> >>> >>> >>> cholesky, inverse, eigen, solve, svd, qr, and lu are current= ly >> >>> >>> >>> only >> >>> >>> >>> supported as single-node operations because they're still >> >>> >>> >>> implemented >> >>> >>> >>> via Apache commons.math. >> >>> >>> >>> >> >>> >>> >>> However, there is an experimental script for distributed >> >>> >>> >>> cholesky >> >>> >>> >>> [1] >> >>> >>> >>> which uses a recursive approach (with operations that allow >> >>> >>> >>> for >> >>> >>> >>> automatic distributed computation) for matrices larger than = a >> >>> >>> >>> user-defined block size. Once blocks become small enough, we >> >>> >>> >>> use >> >>> >>> >>> again >> >>> >>> >>> the builtin cholesky. Graduating this script would require a >> >>> >>> >>> broader >> >>> >>> >>> set of experiments (and potential improvements) but it simpl= y >> >>> >>> >>> did >> >>> >>> >>> not >> >>> >>> >>> have the highest priority so far. You might want to give it = a >> >>> >>> >>> try >> >>> >>> >>> though. >> >>> >>> >>> >> >>> >>> >>> Thanks again for your feedback - we'll consider a higher >> >>> >>> >>> priority >> >>> >>> >>> for >> >>> >>> >>> these distributed operations when discussing the roadmap for >> >>> >>> >>> the >> >>> >>> >>> next >> >>> >>> >>> releases. >> >>> >>> >>> >> >>> >>> >>> [1] >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> >> >>> >>> >>> https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__githu= b.com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_cholesk= y.dml&d=3DDwIBaQ&c=3Djf_iaSHvJObTbx-siA1ZOg&r=3D3mYOfURw_FSirAnoSv2pWvLSi1p= sso4F9RdGjEWL6yc&m=3DFvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kLShXcJxqQ&s=3DYrj4GG= cTlpZGRw34RoON_oO6-xDUtiIEUcO7-qIOyoc&e=3D >> >>> >>> >>> >> >>> >>> >>> Regards, >> >>> >>> >>> Matthias >> >>> >>> >>> >> >>> >>> >>> On Sat, Apr 21, 2018 at 2:15 PM, Qifan Pu >> >>> >>> >>> wrote: >> >>> >>> >>> > Hi, >> >>> >>> >>> > >> >>> >>> >>> > I would love to do distributed cholesky on large matrix wi= th >> >>> >>> >>> > SystemML. I >> >>> >>> >>> > found two related jiras (SYSTEMML-1213, SYSTEMML-1163), bu= t >> >>> >>> >>> > AFAIK, >> >>> >>> >>> > this >> >>> >>> >>> > is >> >>> >>> >>> > currently not implemented? I just wanted to check. >> >>> >>> >>> > >> >>> >>> >>> > Best, >> >>> >>> >>> > Qifan >> >>> >>> >> >> >>> >>> >> >> >>> >> >> >>> >> >> >>> > >> >>> > >> >>> > >> >>> > >> >>> >> >>> >> >>> >> >>> >> >>> >> >> >> > > >