Replication join = broadcast join. Look for that term on google. Many examples.

 

Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function.

 

Not sure about the other two.

 

Sent from my Windows 10 phone

 

From: Aakash Basu
Sent: Thursday, November 17, 2016 3:17 PM
To: user@spark.apache.org
Subject: Join Query

 

Hi,




Conceptually I can understand below spark joins, when it comes to implementation I don’t find much information in Google. Please help me with code/pseudo code for below joins using java-spark or scala-spark.

 

Replication Join:

                Given two datasets, where one is small enough to fit into the memory, perform a Replicated join using Spark.

Note: Need a program to justify this fits for Replication Join.

 

Semi-Join:

                Given a huge dataset, do a semi-join using spark. Note that, with semi-join, one dataset needs to do Filter and projection to fit into the cache.

Note: Need a program to justify this fits for Semi-Join.

 

 

Composite Join:

                Given a dataset whereby a dataset is still too big after filtering and cannot fit into the memory. Perform composite join on a pre-sorted and pre-partitioned data using spark.

Note: Need a program to justify this fits for composite Join.

 

 

Repartition join:

                Join two datasets by performing Repartition join in spark.

Note: Need a program to justify this fits for repartition Join.






Thanks,

Aakash.