spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Langote <anillangote0...@gmail.com>
Subject DataSet is not able to handle 50,000 columns to sum
Date Fri, 11 Nov 2016 17:57:23 GMT
Hi All,

 

I have been working on one use case and couldn’t able to think the better solution, I have
seen you very active on spark user list please throw your thoughts on implementation. Below
is the requirement.

 

I have tried using dataset by splitting the double array column but it fails when double size
grows. When I create the double array schema data type spark doesn’t allow me to sum them
because it would be done only on numeric types. If I think about storing the file per combination
wise to parquet there will be too much parquet files.

 

Input :  The input file will be like below in real data the attributes will be 20 & the
double array would be 50,000

 

 

Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray
53530.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504

 

Now below are the possible combinations in above data set this will be all possible combinations
1.      Attribute_0, Attribute_12.      Attribute_0, Attribute_23.      Attribute_0, Attribute_34.
     Attribute_1, Attribute_25.      Attribute_2, Attribute_36.      Attribute_1, Attribute_37.
     Attribute_0, Attribute_1, Attribute_28.      Attribute_0, Attribute_1, Attribute_39.
     Attribute_0, Attribute_2, Attribute_310.  Attribute_1, Attribute_2, Attribute_311.  Attribute_1,
Attribute_2, Attribute_3, Attribute_4 Now we have to process all these combinations on input
data preferably parallel to get good performance. Attribute_0, Attribute_1 In this iteration
the other attributes (Attribute_2, Attribute_3) are not required all we need is Attribute_0,
Attribute_1 & double array columns. If you see the data there are two possible combination
in the data one is 5_3 and other one is 3_2 we have to pick only those which has at least
2 combinations in real data we will get in thousands.   
Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray
53530.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504

 

when we do the groupBy on above dataset with columns Attribute_0, Attribute_1 we will get
two records with keys 5_3 & 3_2 and each key will have two double arrays.

 

5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
& 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072

 

3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
& 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034

 

now we have to add these double arrays index wise and produce the one array

 

5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117]

3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878]

 

After adding we have to compute average, min, max etc on these vector and store the results
against the keys.

 

Same process will be repeated for next combinations. 

 

 

 

Thank you

Anil Langote

+1-425-633-9747

 


Mime
View raw message