Hi guys,

I'm interested in the IndexedRDD too. 
How many rows in the big table that matches the small table in every run? If the number of rows stay constant, then I think Jem wants the runtime to stay about constant (i.e. ~ 0.6 second for all cases). However, I agree with Andrew. The performance wasn't that bad at all. If it is not indexed, I expect it to take much longer time. 

Can IndexedRDD be sorted by keys as well? 

Best Regards,

Jerry

On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash <andrew@andrewash.com> wrote:
Hi Jem,

Linear time in scaling on the big table doesn't seem that surprising to me.  What were you expecting?

I assume you're doing normalRDD.join(indexedRDD).  If you were to replace the indexedRDD with a normal RDD, what times do you get?

On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker <jem.tucker@gmail.com> wrote:
Hi,
 
I have been playing around with the indexedRDD (https://issues.apache.org/jira/browse/SPARK-2365, https://github.com/amplab/spark-indexedrdd) and have been very impressed with its performance. Some performance testing has revealed worse than expected scaling of the join performance*, and I was just wondering if anyone else has any experience using it and what they have found?
 
Thanks,
 
Jem
 
*Table below shows some of my results when joining a small RDD to a large IndexedRDD.  Each table consisted of a Long key and 15 character String value. Shows an almost linear time increase with the number of rows in the bigger table.

Small Table Rows

 Big Table Rows

Time

(s)

50000

10000000

0.6

50000

50000000

0.8

50000

100000000

1.5

50000

150000000

2.1

50000

200000000

2.8

50000

500000000

7.2

50000

1000000000

12.2