spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From titli batali <titlibat...@gmail.com>
Subject Broadcast Join and Inner Join giving different result on same DataFrame
Date Fri, 30 Dec 2016 15:45:20 GMT
Hi,

I have two dataframes which has common column Product_Id on which i have to
perform a join operation.

    val transactionDF = readCSVToDataFrame(sqlCtx: SQLContext,
pathToReadTransactions: String, transactionSchema: StructType)
    val productDF = readCSVToDataFrame(sqlCtx: SQLContext,
pathToReadProduct:String, productSchema: StructType)

As, transaction data is very large but product data is small, i would
ideally do a  broadcast join where i braodcast productDF.

     val productBroadcastDF =  broadcast(productDF)
     val broadcastJoin = transcationDF.join(productBroadcastDF, "productId")

Or simply,  val innerJoin = transcationDF.join(productDF, "productId")
should give the same result as above.

But If i join using simple inner join i get  dataframe  with joined values
whereas if i do broadcast join i get empty dataframe with empty values. I
am not able to explain this behavior. Ideally both should give the same
result.

What could have gone wrong. Any one faced the similar issue?


Thanks,
Prateek

Mime
View raw message