systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm" <mbo...@us.ibm.com>
Subject Re: 'sample.dml' replaces rows with 0's
Date Fri, 15 Apr 2016 04:41:40 GMT

well, it looks like an issue of incorrect meta data propagation (wrong
propagation of dimensions through mr pmm instructions). The data itself
looks good if I write a 20% sample to textcell (what is used in our
testsuite).

@Shirish: thanks for looking into it. Just fyi, while testing this on an
ultra-sparse scenario, I also encountered a runtime issue of deep copying
sparse rows (fix will be available tomorrow), so for now don't worry about
it if you encounter the same issue.

Regards,
Matthias




From:	Shirish Tatikonda <shirish.tatikonda@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	04/14/2016 08:43 PM
Subject:	Re: 'sample.dml' replaces rows with 0's



Hi Ethan,

I just tried the script on a toy data and I could reproduce this erroneous
behavior when run in Hadoop mode -- both local and Spark modes are good. I
will look into it.

BTW, you forgot to attach the scripts.

Shirish

On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:

> OK this is interesting:
>
> Scenario 1
> I slightly modified 'sample.dml' to add statements to print dimensions of
> SM, P and iX, and ran it on the same data. The dimensions AND the output
> were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> original data.
>
> Please see attached:
> sample-debug.dml:
> sample.dml with 3 print functions inserted
> train-test-debug_1.mtd
> train-test-debug_2.mtd:
> meta data of outputs. Note 'rows' are correct.
>
>
> Scenario 2
> This is confusing so I commented out the 'print' statements in
> 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> That is, subset '1' and '2' contain the same rows as the original data.
>
> Please see attached:
> Please see attached:
> sample-debug-noprint.dml:
> 3 print functions were commented out
> train-test-debug-noprint_1.mtd
> train-test-debug-noprint_2.mtd
> meta data of outputs. Note 'rows' are incorrect.
>
> There was no errors in either trials.
>
> Ethan
>
> On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <ethan.yifanxu@gmail.com>
wrote:
>
>> Hello,
>>
>> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> Hadoop. Instead of splitting the data, it replaced rows of original data
>> with 0's. Here are the details:
>>
>> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
both
>> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>>
>> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> not removed). It is as if line 66 of sample.dml (
>>
https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
)
>> that calls removeEmpty() doesn't exist.
>>
>> Here is the submission script:
>>
>> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>>
>> ## Split file to training and test sets
>> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>>
>>
>> There was no error messages and all MR jobs were executed successfully.
>> What other information can I provide to diagnose the issue?
>>
>> Thanks,
>>
>> Ethan
>>
>>
>>
>>
>>
>


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message