spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anil Langote <>
Subject Re: Efficient look up in Key Pair RDD
Date Mon, 09 Jan 2017 03:29:14 GMT
Hi Ayan


Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF but it doesn’t
perform well. for one combination it takes 1.5 mins in my use case I have 400 combinations
which will take ~400 mins I am looking for a solution which will scale on the combinations.


Thank you

Anil Langote




From: ayan guha <>
Date: Sunday, January 8, 2017 at 10:26 PM
To: Anil Langote <>
Cc: Holden Karau <>, user <>
Subject: Re: Efficient look up in Key Pair RDD


Have you tried something like GROUPING SET? That seems to be the exact thing you are looking


On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <> wrote:

Sure. Let me explain you my requirement I have an input file which has attributes (25) and
las column is array of doubles (14500 elements in original file)


53530.2938933463658645  0.0437040427073041  0.23002681025029648  0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  0.7451477943975504

Now I have to compute the addition of the double for any given combination for example in
above file we will have below possible combinations


1.      Attribute_0, Attribute_1

2.      Attribute_0, Attribute_2

3.      Attribute_0, Attribute_3

4.      Attribute_1, Attribute_2

5.      Attribute_2, Attribute_3

6.      Attribute_1, Attribute_3

7.      Attribute_0, Attribute_1, Attribute_2

8.      Attribute_0, Attribute_1, Attribute_3

9.      Attribute_0, Attribute_2, Attribute_3

10.  Attribute_1, Attribute_2, Attribute_3

11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4


now if we process the Attribute_0, Attribute_1 combination we want below output. In similar
way we have to process all the above combinations


5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117]

3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878]


Solution tried


I have created parequet file which will have the schema and last column will be array of doubles.
The size of the parquet file I have is 276G which has 2.65 M records.


I have implemented the UDAF which will have 


Input schema : array of doubles

Buffer schema : array of doubles 

Return schema : array of doubles


I load the data from parquet file and then register the UDAF to use with below query, note
that SUM is UDAF


BY Attribute_0, Attribute_1 HAVING COUNT(*)>1


This works fine and it takes 1.2 mins for one combination my use case will have 400 combinations
which means 8 hours which is not meeting the SLA we want this to be below 1 hours. What is
the best way to implement this use case.


Best Regards,

Anil Langote


On Jan 8, 2017, at 8:17 PM, Holden Karau <> wrote:

To start with caching and having a known partioner will help a bit, then there is also the
IndexedRDD project, but in general spark might not be the best tool for the job.  Have you
considered having Spark output to something like memcache?


What's the goal of you are trying to accomplish?


On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <> wrote:

Hi All,


I have a requirement where I wanted to build a distributed HashMap which holds 10M key value
pairs and provides very efficient lookups for each key. I tried loading the file into JavaPairedRDD
and tried calling lookup method its very slow.


How can I achieve very very faster lookup by a given key?


Thank you

Anil Langote 



Best Regards,
Ayan Guha

View raw message