spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Langote <anillangote0...@gmail.com>
Subject Re: DataSet is not able to handle 50,000 columns to sum
Date Sat, 12 Nov 2016 03:33:28 GMT
All right thanks for inputs is there any way spark can process all combination parallel in
one job ? 

If is it ok to load the input csv file in dataframe and use flat map to create key pair, then
use reduceByKey to sum the double array? I believe that will work same like agg function which
you are suggesting.

Best Regards,
Anil Langote
+1-425-633-9747

> On Nov 11, 2016, at 7:10 PM, ayan guha <guha.ayan@gmail.com> wrote:
> 
> You can explore grouping sets in SQL and write an aggregate function to add array wise
sum.
> 
> It will boil down to something like
> 
> Select attr1,attr2...,yourAgg(Val)
> From t
> Group by attr1,attr2...
> Grouping sets((attr1,attr2),(aytr1))
> 
>> On 12 Nov 2016 04:57, "Anil Langote" <anillangote0106@gmail.com> wrote:
>> Hi All,
>> 
>>  
>> 
>> I have been working on one use case and couldn’t able to think the better solution,
I have seen you very active on spark user list please throw your thoughts on implementation.
Below is the requirement.
>> 
>>  
>> 
>> I have tried using dataset by splitting the double array column but it fails when
double size grows. When I create the double array schema data type spark doesn’t allow me
to sum them because it would be done only on numeric types. If I think about storing the file
per combination wise to parquet there will be too much parquet files.
>> 
>>  
>> 
>> Input :  The input file will be like below in real data the attributes will be 20
& the double array would be 50,000
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
>> 
>>  
>> 
>> Now below are the possible combinations in above data set this will be all possible
combinations
>> 
>>  
>> 
>> 1.      Attribute_0, Attribute_1
>> 
>> 2.      Attribute_0, Attribute_2
>> 
>> 3.      Attribute_0, Attribute_3
>> 
>> 4.      Attribute_1, Attribute_2
>> 
>> 5.      Attribute_2, Attribute_3
>> 
>> 6.      Attribute_1, Attribute_3
>> 
>> 7.      Attribute_0, Attribute_1, Attribute_2
>> 
>> 8.      Attribute_0, Attribute_1, Attribute_3
>> 
>> 9.      Attribute_0, Attribute_2, Attribute_3
>> 
>> 10.  Attribute_1, Attribute_2, Attribute_3
>> 
>> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>> 
>>  
>> 
>> Now we have to process all these combinations on input data preferably parallel to
get good performance.
>> 
>>  
>> 
>> Attribute_0, Attribute_1
>> 
>>  
>> 
>> In this iteration the other attributes (Attribute_2, Attribute_3) are not required
all we need is Attribute_0, Attribute_1 & double array columns. If you see the data there
are two possible combination in the data one is 5_3 and other one is 3_2 we have to pick only
those which has at least 2 combinations in real data we will get in thousands. 
>> 
>>  
>> 
>>  
>> 
>> Attribute_0
>> 
>> Attribute_1
>> 
>> Attribute_2
>> 
>> Attribute_3
>> 
>> DoubleArray
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 3
>> 
>> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
>> 
>> 5
>> 
>> 3
>> 
>> 5
>> 
>> 2
>> 
>> 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>> 3
>> 
>> 2
>> 
>> 1
>> 
>> 3
>> 
>> 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>> 2
>> 
>> 2
>> 
>> 0
>> 
>> 5
>> 
>> 0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
>> 
>>  
>> 
>> when we do the groupBy on above dataset with columns Attribute_0, Attribute_1 we
will get two records with keys 5_3 & 3_2 and each key will have two double arrays.
>> 
>>  
>> 
>> 5_3 ==> 0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
& 0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
>> 
>>  
>> 
>> 3_2 ==> 0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
& 0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
>> 
>>  
>> 
>> now we have to add these double arrays index wise and produce the one array
>> 
>>  
>> 
>> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117]
>> 
>> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878]
>> 
>>  
>> 
>> After adding we have to compute average, min, max etc on these vector and store the
results against the keys.
>> 
>>  
>> 
>> Same process will be repeated for next combinations. 
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Thank you
>> 
>> Anil Langote
>> 
>> +1-425-633-9747
>> 
>>  

Mime
View raw message