spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Langote <anillangote0...@gmail.com>
Subject Re: Efficient look up in Key Pair RDD
Date Mon, 09 Jan 2017 01:37:19 GMT
Sure. Let me explain you my requirement I have an input file which has attributes (25) and
las column is array of doubles (14500 elements in original file)
 
Attribute_0
Attribute_1
Attribute_2
Attribute_3
DoubleArray
5
3
5
3
0.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
3
2
1
3
0.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
5
3
5
2
0.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
3
2
1
3
0.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
2
2
0
5
0.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504
 
Now I have to compute the addition of the double for any given combination for example in
above file we will have below possible combinations
 
1.      Attribute_0, Attribute_1
2.      Attribute_0, Attribute_2
3.      Attribute_0, Attribute_3
4.      Attribute_1, Attribute_2
5.      Attribute_2, Attribute_3
6.      Attribute_1, Attribute_3
7.      Attribute_0, Attribute_1, Attribute_2
8.      Attribute_0, Attribute_1, Attribute_3
9.      Attribute_0, Attribute_2, Attribute_3
10.  Attribute_1, Attribute_2, Attribute_3
11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
 
now if we process the Attribute_0, Attribute_1 combination we want below output. In similar
way we have to process all the above combinations
 
5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117]
3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878]
 
Solution tried
 
I have created parequet file which will have the schema and last column will be array of doubles.
The size of the parquet file I have is 276G which has 2.65 M records.
 
I have implemented the UDAF which will have 
 
Input schema : array of doubles
Buffer schema : array of doubles 
Return schema : array of doubles
 
I load the data from parquet file and then register the UDAF to use with below query, note
that SUM is UDAF
 
SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), Attribute_0, Attribute_1 FROM RAW_TABLE GROUP
BY Attribute_0, Attribute_1 HAVING COUNT(*)>1
 
This works fine and it takes 1.2 mins for one combination my use case will have 400 combinations
which means 8 hours which is not meeting the SLA we want this to be below 1 hours. What is
the best way to implement this use case.

Best Regards,
Anil Langote
+1-425-633-9747

> On Jan 8, 2017, at 8:17 PM, Holden Karau <holden@pigscanfly.ca> wrote:
> 
> To start with caching and having a known partioner will help a bit, then there is also
the IndexedRDD project, but in general spark might not be the best tool for the job.  Have
you considered having Spark output to something like memcache?
> 
> What's the goal of you are trying to accomplish?
> 
>> On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0106@gmail.com> wrote:
>> Hi All,
>> 
>> I have a requirement where I wanted to build a distributed HashMap which holds 10M
key value pairs and provides very efficient lookups for each key. I tried loading the file
into JavaPairedRDD and tried calling lookup method its very slow.
>> 
>> How can I achieve very very faster lookup by a given key?
>> 
>> Thank you
>> Anil Langote 

Mime
View raw message