spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shreya Agarwal <shrey...@microsoft.com>
Subject RE: Join Query
Date Mon, 21 Nov 2016 05:15:02 GMT

Replication join = broadcast join. Look for that term on google. Many examples.

Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter
on the join/joinWith function.

Not sure about the other two.

Sent from my Windows 10 phone

From: Aakash Basu<mailto:aakash.spark.raj@gmail.com>
Sent: Thursday, November 17, 2016 3:17 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Join Query

Hi,



Conceptually I can understand below spark joins, when it comes to implementation I don’t
find much information in Google. Please help me with code/pseudo code for below joins using
java-spark or scala-spark.

Replication Join:
                Given two datasets, where one is small enough to fit into the memory, perform
a Replicated join using Spark.
Note: Need a program to justify this fits for Replication Join.

Semi-Join:
                Given a huge dataset, do a semi-join using spark. Note that, with semi-join,
one dataset needs to do Filter and projection to fit into the cache.
Note: Need a program to justify this fits for Semi-Join.


Composite Join:
                Given a dataset whereby a dataset is still too big after filtering and cannot
fit into the memory. Perform composite join on a pre-sorted and pre-partitioned data using
spark.
Note: Need a program to justify this fits for composite Join.


Repartition join:
                Join two datasets by performing Repartition join in spark.
Note: Need a program to justify this fits for repartition Join.





Thanks,
Aakash.

Mime
View raw message