systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alok Singh" <sing...@us.ibm.com>
Subject Fw: Questions/query about recode / transform in systemML
Date Tue, 24 May 2016 05:23:24 GMT
Hi 

Sending it to the dev list as per Matthias suggestions

Alok

----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM 
-----

From:   Matthias Boehm/Almaden/IBM
To:     Alok Singh/San Francisco/IBM@IBMUS
Cc:     Arvind Surve/San Jose/IBM@IBMUS
Date:   05/23/2016 09:02 PM
Subject:        Re: Questions/query about recode / transform in systemML


Hi Alok,

would you mind posting this question on our dev mailing list such that 
other people also benefit from it? Thanks.


Regards,
Matthias



From:   Alok Singh/San Francisco/IBM
To:     Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS
Date:   05/23/2016 07:19 PM
Subject:        Questions/query about recode / transform in systemML



Hi Matthias and Arvind.
 
I had the questions about the internals and how the scan happens in 
systemML transform
 
 
Question 1
 
Lets consider an example of dataframe as follows (first line is schema)
 
userID , county, state
================
1, sanJose,CA
2, santaClara,CA
3,sanJose,CA
4,alameda,CA
5,minnepolis,MN
 
 
we can see that uniq for county is {sanJose, alameda, minnepolis} and for 
state is {CA,MN}
 
so example as the doc at 
http://apache.github.io/incubator-systemml/files/dml-language-reference/data.spec.json
 
user pass in the spec file as
"recode": ["country", "state"]
 
then the question is how many passes systemML will make for the dataframe
.i.e in general the recode algo would be
 
for  column  in columns:
   step 1) find uniq for the column
 
   step 2) apply recode value  for column 
 
 
so does it mean , we would need 2*count(columns) pass on the dataframe?
 
if not , then how systemML internally doesn't do more than 
2*count(columns)?
 
Question 2
 
Lets consider another dataframe as follows (first line is schema)
 
random_string
===========
col1
dsfsdf
xcvxcv
sdf
etc
foo
Dummy
 
we can definitely see that number of unique for this df will be almost 
same as number of rows
and what if number of rows is 10 trillion and also number of unique for 
column random_string is 10 trillion .
in that case, the whole uniq data will not fit in the one node. so in that 
case how does systemML handle it?
 
 
Thanks for the inputs
Alok




Mime
View raw message