systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Xu <ethan.yifa...@gmail.com>
Subject Re: 'sample.dml' replaces rows with 0's
Date Fri, 15 Apr 2016 15:57:55 GMT
Another attempt to attach scripts.

On Fri, Apr 15, 2016 at 11:51 AM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:

> Thank you Shirish and Matthias for looking into this issue. I got some
> small updates from more runs.
>
> Shirish, Hmm my browser told me that the scripts were attached. There must
> be some connection issue.  I attached them again to this email. Hope they
> got through this time. I also tested the same scripts on small toy data in
> local mode and they behaved correctly.
>
> Matthias you mentioned in your testsuite the metadata was incorrect but
> the dataset itself looked OK. In my case both the metadata and the data
> seem to be incorrect. Here is how this was confirmed:
>
> The output of sample-debug-noprint.dml (attached) contains 4 files:
> "1", "1.mtd" (attached as train-test-debug-noprint-1.mtd), "2", "2.mtd"
> (attached as train-test-debug-noprint-1.mtd).
> The auto generated metadata indicates there are 35478061 rows in "1".
>
>    1. I replaced the automatically generated metadata file of "1" with a
>    generic one (attached as 1-generic.mtd) which does not specify the number
>    of rows.
>    2. I ran a script (attached "countzeros.dml") to find the number of
>    rows, as well as the number of 0's in each column of "1". The script
>    returned that there were 35479057 rows in "1", which was 996 more than
>    what's shown in the metadata (???).
>    3. I ran the same script to count rows and 0's of the original data
>    set on which 'sample-debug-print.dml' was run. The number of rows was
>    35478061.
>    4. I found the difference of the number of 0's (by column) between the
>    the original data and "1". The columns that contained no 0's in the
>    original data set had 7099710 zeros in "1", which is roughly 20% of row
>    counts.
>    5. Therefore it still looks like for some reason
>    'sample-debug-noprint.dml' did randomly replaced 20% of rows with 0's but
>    didn't remove them. Also the sizes of the original data and "1" are 178G
>    and 186.3G on HDFS.
>
> I did use a custom configuration for all the submissions. The
> configuration file is also attached.
>
> Thanks,
>
> Ethan
>
>
>
>
>
>
>
> On Fri, Apr 15, 2016 at 12:41 AM, Matthias Boehm <mboehm@us.ibm.com>
> wrote:
>
>> well, it looks like an issue of incorrect meta data propagation (wrong
>> propagation of dimensions through mr pmm instructions). The data itself
>> looks good if I write a 20% sample to textcell (what is used in our
>> testsuite).
>>
>> @Shirish: thanks for looking into it. Just fyi, while testing this on an
>> ultra-sparse scenario, I also encountered a runtime issue of deep copying
>> sparse rows (fix will be available tomorrow), so for now don't worry about
>> it if you encounter the same issue.
>>
>> Regards,
>> Matthias
>>
>>
>> [image: Inactive hide details for Shirish Tatikonda ---04/14/2016
>> 08:43:34 PM---Hi Ethan, I just tried the script on a toy data and I c]Shirish
>> Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on
>> a toy data and I could reproduce this erroneous
>>
>> From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
>> To: dev@systemml.incubator.apache.org
>> Date: 04/14/2016 08:43 PM
>> Subject: Re: 'sample.dml' replaces rows with 0's
>> ------------------------------
>>
>>
>>
>> Hi Ethan,
>>
>> I just tried the script on a toy data and I could reproduce this erroneous
>> behavior when run in Hadoop mode -- both local and Spark modes are good. I
>> will look into it.
>>
>> BTW, you forgot to attach the scripts.
>>
>> Shirish
>>
>> On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <ethan.yifanxu@gmail.com>
>> wrote:
>>
>> > OK this is interesting:
>> >
>> > Scenario 1
>> > I slightly modified 'sample.dml' to add statements to print dimensions
>> of
>> > SM, P and iX, and ran it on the same data. The dimensions AND the output
>> > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
>> > original data.
>> >
>> > Please see attached:
>> > sample-debug.dml:
>> > sample.dml with 3 print functions inserted
>> > train-test-debug_1.mtd
>> > train-test-debug_2.mtd:
>> > meta data of outputs. Note 'rows' are correct.
>> >
>> >
>> > Scenario 2
>> > This is confusing so I commented out the 'print' statements in
>> > 'sample.dml' and ran it on the same data, and the output were INCORRECT.
>> > That is, subset '1' and '2' contain the same rows as the original data.
>> >
>> > Please see attached:
>> > Please see attached:
>> > sample-debug-noprint.dml:
>> > 3 print functions were commented out
>> > train-test-debug-noprint_1.mtd
>> > train-test-debug-noprint_2.mtd
>> > meta data of outputs. Note 'rows' are incorrect.
>> >
>> > There was no errors in either trials.
>> >
>> > Ethan
>> >
>> > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <ethan.yifanxu@gmail.com>
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> I encountered an unexpected behavior from 'sample.dml' on a dataset on
>> >> Hadoop. Instead of splitting the data, it replaced rows of original
>> data
>> >> with 0's. Here are the details:
>> >>
>> >> I called sample.dml in attempt to split is a 35 million by 2396 numeric
>> >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
>> both
>> >> still contain 35 million rows, instead of 35*80% and 35*20% rows.
>> >>
>> >> However it looks like 20% of the rows in '1' are replaced with 0's (but
>> >> not removed). It is as if line 66 of sample.dml (
>> >>
>> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
>> )
>> >> that calls removeEmpty() doesn't exist.
>> >>
>> >> Here is the submission script:
>> >>
>> >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
>> >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
>> >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
>> >>
>> >> ## Split file to training and test sets
>> >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
>> >> -config=$sysConfCust -nvargs X=/path/originalData.csv
>> >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
>> >>
>> >>
>> >> There was no error messages and all MR jobs were executed successfully.
>> >> What other information can I provide to diagnose the issue?
>> >>
>> >> Thanks,
>> >>
>> >> Ethan
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>

Mime
View raw message