systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm" <mbo...@us.ibm.com>
Subject Re: parfor fails
Date Fri, 15 Apr 2016 03:14:46 GMT

just for completeness, this issue is tracked with
https://issues.apache.org/jira/browse/SYSTEMML-635 and the fix will be
available tomorrow.

Regards,
Matthias



From:	Matthias Boehm/Almaden/IBM@IBMUS
To:	dev@systemml.incubator.apache.org
Cc:	"Ethan Xu" <ethan.yifanxu@gmail.com>
Date:	04/14/2016 07:53 PM
Subject:	Re: parfor fails



Hi Ethan,

thanks for catching this issue. The parfor script itself is perfectly fine
but you encountered an interesting runtime bug. Usually, you can find the
actual cause at the bottom of the stacktrace or in previous exceptions. I
was able to reproduce this issue if NO systemml config file is provided
(fails on parsing this non-existing config in the parfor mr job task
setup). So the workaround is to put a SystemML-config.xml into the same
directory. Interestingly, the issue did not show up in our testsuite
because we always specify a default configuration there (which was until
recently mandatory).

As a side note, we strongly recommend parfor over for loops here because it
runs the entire loop in 1 instead of 2396 MR jobs due to automatic data
partitioning. However, for the specific example at hand, a data-parallel
formulation (with "s = colSums(x==0)") would be even better as it allows
for partial aggregation and hence reduces shuffle.

Regards,
Matthias

Ethan Xu ---04/14/2016 01:34:24 PM---Hello, I have a quick question. The
following script fails with this error:

From: Ethan Xu <ethan.yifanxu@gmail.com>
To: dev@systemml.incubator.apache.org
Date: 04/14/2016 01:34 PM
Subject: parfor fails



Hello,

I have a quick question. The following script fails with this error:

org.apache.sysml.runtime.DMLRuntimeException: PARFOR: Failed to execute
loop in parallel.

Here is the dml script:

x=read($X);

print("number of rows of x = " + nrow(x));
print("number of cols of x = " + ncol(x));

parfor(i in 1:ncol(x), check=0){
   a = x[,i];
   print("number of 0's in col " + i + " = " + sum(a == 0));
}

where X is a 35 million by 2396 matrix (coded and dummy coded numerical
matrix) on HDFS. The script runs fine with regular 'for' loops.

Could someone explain why this script cannot run in parallel? Was it a
wrong way to code parfor?

Thanks,

Ethan



Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message