systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Xu <ethan.yifa...@gmail.com>
Subject Re: 'sample.dml' replaces rows with 0's
Date Fri, 15 Apr 2016 15:51:19 GMT
Thank you Shirish and Matthias for looking into this issue. I got some
small updates from more runs.

Shirish, Hmm my browser told me that the scripts were attached. There must
be some connection issue.  I attached them again to this email. Hope they
got through this time. I also tested the same scripts on small toy data in
local mode and they behaved correctly.

Matthias you mentioned in your testsuite the metadata was incorrect but the
dataset itself looked OK. In my case both the metadata and the data seem to
be incorrect. Here is how this was confirmed:

The output of sample-debug-noprint.dml (attached) contains 4 files:
"1", "1.mtd" (attached as train-test-debug-noprint-1.mtd), "2", "2.mtd"
(attached as train-test-debug-noprint-1.mtd).
The auto generated metadata indicates there are 35478061 rows in "1".

   1. I replaced the automatically generated metadata file of "1" with a
   generic one (attached as 1-generic.mtd) which does not specify the number
   of rows.
   2. I ran a script (attached "countzeros.dml") to find the number of
   rows, as well as the number of 0's in each column of "1". The script
   returned that there were 35479057 rows in "1", which was 996 more than
   what's shown in the metadata (???).
   3. I ran the same script to count rows and 0's of the original data set
   on which 'sample-debug-print.dml' was run. The number of rows was 35478061.
   4. I found the difference of the number of 0's (by column) between the
   the original data and "1". The columns that contained no 0's in the
   original data set had 7099710 zeros in "1", which is roughly 20% of row
   counts.
   5. Therefore it still looks like for some reason
   'sample-debug-noprint.dml' did randomly replaced 20% of rows with 0's but
   didn't remove them. Also the sizes of the original data and "1" are 178G
   and 186.3G on HDFS.

I did use a custom configuration for all the submissions. The configuration
file is also attached.

Thanks,

Ethan







On Fri, Apr 15, 2016 at 12:41 AM, Matthias Boehm <mboehm@us.ibm.com> wrote:

> well, it looks like an issue of incorrect meta data propagation (wrong
> propagation of dimensions through mr pmm instructions). The data itself
> looks good if I write a 20% sample to textcell (what is used in our
> testsuite).
>
> @Shirish: thanks for looking into it. Just fyi, while testing this on an
> ultra-sparse scenario, I also encountered a runtime issue of deep copying
> sparse rows (fix will be available tomorrow), so for now don't worry about
> it if you encounter the same issue.
>
> Regards,
> Matthias
>
>
> [image: Inactive hide details for Shirish Tatikonda ---04/14/2016 08:43:34
> PM---Hi Ethan, I just tried the script on a toy data and I c]Shirish
> Tatikonda ---04/14/2016 08:43:34 PM---Hi Ethan, I just tried the script on
> a toy data and I could reproduce this erroneous
>
> From: Shirish Tatikonda <shirish.tatikonda@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 04/14/2016 08:43 PM
> Subject: Re: 'sample.dml' replaces rows with 0's
> ------------------------------
>
>
>
> Hi Ethan,
>
> I just tried the script on a toy data and I could reproduce this erroneous
> behavior when run in Hadoop mode -- both local and Spark modes are good. I
> will look into it.
>
> BTW, you forgot to attach the scripts.
>
> Shirish
>
> On Thu, Apr 14, 2016 at 5:02 PM, Ethan Xu <ethan.yifanxu@gmail.com> wrote:
>
> > OK this is interesting:
> >
> > Scenario 1
> > I slightly modified 'sample.dml' to add statements to print dimensions of
> > SM, P and iX, and ran it on the same data. The dimensions AND the output
> > were correct. That is, subset '1' and '2' contain roughly 80% and 20% of
> > original data.
> >
> > Please see attached:
> > sample-debug.dml:
> > sample.dml with 3 print functions inserted
> > train-test-debug_1.mtd
> > train-test-debug_2.mtd:
> > meta data of outputs. Note 'rows' are correct.
> >
> >
> > Scenario 2
> > This is confusing so I commented out the 'print' statements in
> > 'sample.dml' and ran it on the same data, and the output were INCORRECT.
> > That is, subset '1' and '2' contain the same rows as the original data.
> >
> > Please see attached:
> > Please see attached:
> > sample-debug-noprint.dml:
> > 3 print functions were commented out
> > train-test-debug-noprint_1.mtd
> > train-test-debug-noprint_2.mtd
> > meta data of outputs. Note 'rows' are incorrect.
> >
> > There was no errors in either trials.
> >
> > Ethan
> >
> > On Thu, Apr 14, 2016 at 4:37 PM, Ethan Xu <ethan.yifanxu@gmail.com>
> wrote:
> >
> >> Hello,
> >>
> >> I encountered an unexpected behavior from 'sample.dml' on a dataset on
> >> Hadoop. Instead of splitting the data, it replaced rows of original data
> >> with 0's. Here are the details:
> >>
> >> I called sample.dml in attempt to split is a 35 million by 2396 numeric
> >> matrix to two 80% and 20% subsets. The two outcome subsets '1' and '2'
> both
> >> still contain 35 million rows, instead of 35*80% and 35*20% rows.
> >>
> >> However it looks like 20% of the rows in '1' are replaced with 0's (but
> >> not removed). It is as if line 66 of sample.dml (
> >>
> https://github.com/apache/incubator-systemml/blob/master/scripts/utils/sample.dml
> )
> >> that calls removeEmpty() doesn't exist.
> >>
> >> Here is the submission script:
> >>
> >> printf "0.8\n0.2" | hadoop fs -put - /path/split-perc.csv
> >> echo '{"data_type": "matrix", "value_type":"double", "rows": 2, "cols":
> >> 1, "format": "csv"}' | hadoop fs -put - /path/split-perc.csv.mtd
> >>
> >> ## Split file to training and test sets
> >> hadoop jar $sysJar /path.to.systemML/scripts/utils/sample.dml
> >> -config=$sysConfCust -nvargs X=/path/originalData.csv
> >> sv=/path/split-perc.csv O=/path/train-test ofmt=csv
> >>
> >>
> >> There was no error messages and all MR jobs were executed successfully.
> >> What other information can I provide to diagnose the issue?
> >>
> >> Thanks,
> >>
> >> Ethan
> >>
> >>
> >>
> >>
> >>
> >
>
>
>

Mime
View raw message